Section of Integrative Biology, Institute for Cellular and Molecular Biology, Center for Computational Biology and Bioinformatics, University of Texas at Austin, Austin, TX, USA.
Mol Biol Evol. 2013 Jan;30(1):36-44. doi: 10.1093/molbev/mss217. Epub 2012 Sep 12.
We present a novel method to identify sites under selection in protein-coding genes. Our method combines the traditional Goldman-Yang model of coding-sequence evolution with the information obtained from the 3D structure of the evolving protein, specifically the relative solvent accessibility (RSA) of individual residues. We develop a random-effects likelihood sites model in which rate classes are RSA dependent. The RSA dependence is modeled with linear functions. We demonstrate that our RSA-dependent model provides a significantly better fit to molecular sequence data than does a traditional, RSA-independent model. We further show that our model provides a natural, RSA-dependent neutral baseline for the evolutionary rate ratio ω = dN/dS Sites that deviate from this neutral baseline likely experience selection pressure for function. We apply our method to the influenza proteins hemagglutinin and neuraminidase. For hemagglutinin, our method recovers positively selected sites near the sialic acid-binding site and negatively selected sites that may be important for trimerization. For neuraminidase, our method recovers the oseltamivir resistance site and otherwise suggests that few sites deviate from the neutral baseline. Our method is broadly applicable to any protein sequences for which structural data are available or can be obtained via homology modeling or threading.
我们提出了一种鉴定蛋白质编码基因中受选择影响的位点的新方法。我们的方法将传统的编码序列进化的 Goldman-Yang 模型与进化蛋白的 3D 结构(特别是个别残基的相对溶剂可及性(RSA))所获得的信息相结合。我们开发了一个随机效应似然位点模型,其中速率类别依赖于 RSA。RSA 依赖性用线性函数来建模。我们证明,与传统的、不依赖 RSA 的模型相比,我们的 RSA 依赖模型能更好地拟合分子序列数据。我们进一步表明,我们的模型为进化率比ω=dN/dS 提供了一个自然的、依赖 RSA 的中性基线,而偏离这个中性基线的位点可能经历了功能选择压力。我们将我们的方法应用于流感蛋白血凝素和神经氨酸酶。对于血凝素,我们的方法在唾液酸结合位点附近恢复了阳性选择的位点,以及可能对三聚体化很重要的阴性选择的位点。对于神经氨酸酶,我们的方法恢复了奥司他韦耐药位点,否则表明很少有位点偏离中性基线。我们的方法广泛适用于任何有结构数据或可通过同源建模或穿线获得结构数据的蛋白质序列。