Shauli Tair, Brandes Nadav, Linial Michal
The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, 91904, Israel.
Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, 91904, Israel.
NAR Genom Bioinform. 2021 Sep 16;3(3):lqab079. doi: 10.1093/nargab/lqab079. eCollection 2021 Sep.
Human genetic variation in coding regions is fundamental to the study of protein structure and function. Most methods for interpreting missense variants consider substitution measures derived from homologous proteins across different species. In this study, we introduce human-specific amino acid (AA) substitution matrices that are based on genetic variations in the modern human population. We analyzed the frequencies of >4.8M single nucleotide variants (SNVs) at codon and AA resolution and compiled human-centric substitution matrices that are fundamentally different from classic cross-species matrices (e.g. BLOSUM, PAM). Our matrices are asymmetric, with some AA replacements showing significant directional preference. Moreover, these AA matrices are only partly predicted by nucleotide substitution rates. We further test the utility of our matrices in exposing functional signals of experimentally-validated protein annotations. A significant reduction in AA transition frequencies was observed across nine post-translational modification (PTM) types and four ion-binding sites. Our results propose a purifying selection signal in the human proteome across a diverse set of functional protein annotations and provide an empirical baseline for interpreting human genetic variation in coding regions.
编码区域的人类遗传变异是蛋白质结构与功能研究的基础。大多数解释错义变体的方法都考虑了来自不同物种同源蛋白质的替换度量。在本研究中,我们引入了基于现代人类群体遗传变异的人类特异性氨基酸(AA)替换矩阵。我们在密码子和氨基酸分辨率下分析了超过480万个单核苷酸变体(SNV)的频率,并编制了以人类为中心的替换矩阵,这些矩阵与经典的跨物种矩阵(如BLOSUM、PAM)有根本区别。我们的矩阵是不对称的,一些氨基酸替换显示出明显的方向偏好。此外,这些氨基酸矩阵仅部分由核苷酸替换率预测。我们进一步测试了我们的矩阵在揭示经实验验证的蛋白质注释的功能信号方面的效用。在九种翻译后修饰(PTM)类型和四个离子结合位点中,观察到氨基酸转换频率显著降低。我们的结果提出了人类蛋白质组中跨多种功能蛋白质注释的纯化选择信号,并为解释编码区域的人类遗传变异提供了一个经验基线。