Zhao Zi-Ming, Campbell Michael C, Li Ning, Lee Daniel S W, Zhang Zhang, Townsend Jeffrey P
Department of Biostatistics, Yale University, New Haven, CT.
Department of Biology, Howard University, Washington, DC.
Mol Biol Evol. 2017 Nov 1;34(11):3006-3022. doi: 10.1093/molbev/msx213.
Numerous approaches have been developed to infer natural selection based on the comparison of polymorphism within species and divergence between species. These methods are especially powerful for the detection of uniform selection operating across a gene. However, empirical analyses have demonstrated that regions of protein-coding genes exhibiting clusters of amino acid substitutions are subject to different levels of selection relative to other regions of the same gene. To quantify this heterogeneity of selection within coding sequences, we developed Model Averaged Site Selection via Poisson Random Field (MASS-PRF). MASS-PRF identifies an ensemble of intragenic clustering models for polymorphic and divergent sites. This ensemble of models is used within the Poisson Random Field framework to estimate selection intensity on a site-by-site basis. Using simulations, we demonstrate that MASS-PRF has high power to detect clusters of amino acid variants in small genic regions, can reliably estimate the probability of a variant occurring at each nucleotide site in sequence data and is robust to historical demographic trends and recombination. We applied MASS-PRF to human gene polymorphism derived from the 1,000 Genomes Project and divergence data from the common chimpanzee. On the basis of this analysis, we discovered striking regional variation in selection intensity, indicative of positive or negative selection, in well-defined domains of genes that have previously been associated with neurological processing, immunity, and reproduction. We suggest that amino acid-altering substitutions within these regions likely are or have been selectively advantageous in the human lineage, playing important roles in protein function.
基于物种内多态性与物种间分歧的比较,已经开发出了许多推断自然选择的方法。这些方法在检测跨基因的一致选择方面特别有效。然而,实证分析表明,与同一基因的其他区域相比,表现出氨基酸替换簇的蛋白质编码基因区域受到不同程度的选择。为了量化编码序列中这种选择的异质性,我们开发了基于泊松随机场的模型平均位点选择方法(MASS-PRF)。MASS-PRF为多态性和分歧位点识别一组基因内聚类模型。在泊松随机场框架内使用这组模型,逐位点估计选择强度。通过模拟,我们证明MASS-PRF具有很高的能力来检测小基因区域中的氨基酸变异簇,能够可靠地估计序列数据中每个核苷酸位点出现变异的概率,并且对历史人口趋势和重组具有稳健性。我们将MASS-PRF应用于来自千人基因组计划的人类基因多态性以及普通黑猩猩的分歧数据。基于这项分析,我们在先前与神经处理、免疫和生殖相关的基因的明确结构域中发现了选择强度的显著区域差异,表明存在正选择或负选择。我们认为这些区域内的氨基酸替换可能在人类谱系中是或曾经是选择性有利的,在蛋白质功能中发挥重要作用。