Liu Li, Tamura Koichiro, Sanderford Maxwell, Gray Vanessa E, Kumar Sudhir
Department of Biomedical Informatics, Arizona State University, Scottsdale Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphila.
Department of Biological Sciences, Tokyo Metropolitan University, Hachioji, Tokyo, Japan.
Mol Biol Evol. 2016 Jan;33(1):245-54. doi: 10.1093/molbev/msv198. Epub 2015 Oct 13.
Widespread sequencing efforts are revealing unprecedented amount of genomic variation in populations. Such information is routinely used to derive consensus reference sequences and to infer positions subject to natural selection. Here, we present a new molecular evolutionary method for estimating neutral evolutionary probabilities (EPs) of each amino acid, or nucleotide state at a genomic position without using intraspecific polymorphism data. Because EPs are derived independently of population-level information, they serve as null expectations that can be used to evaluate selective forces on alleles at both polymorphic and monomorphic positions in populations. We applied this method to coding sequences in the human genome and produced a comprehensive evolutionary variome reference for all human proteins. We found that EPs accurately predict neutral and disease-associated alleles. Through an analysis of discordance between allelic EPs and their observed population frequencies, we discovered thousands of novel candidate sites for nonneutral evolution in human proteins. Many of these were validated in a joint analysis of disease-associated variants and population data. The EP method is also directly applicable to the analysis of noncoding sequences and genomic analyses of nonmodel species.
广泛的测序工作正在揭示人群中前所未有的基因组变异数量。此类信息通常用于推导共识参考序列,并推断受自然选择影响的位置。在此,我们提出一种新的分子进化方法,用于估计基因组位置上每个氨基酸或核苷酸状态的中性进化概率(EP),而无需使用种内多态性数据。由于EP是独立于群体水平信息推导得出的,它们可作为零假设期望,用于评估群体中多态和单态位置上等位基因的选择力。我们将此方法应用于人类基因组中的编码序列,并为所有人类蛋白质生成了一个全面的进化变异组参考。我们发现EP能够准确预测中性和疾病相关等位基因。通过分析等位基因EP与其观察到的群体频率之间的不一致性,我们发现了数千个人类蛋白质中非中性进化的新候选位点。其中许多在疾病相关变异和群体数据的联合分析中得到了验证。EP方法也直接适用于非编码序列的分析以及非模式物种的基因组分析。