Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37232, USA.
Department of Computer Science, Vanderbilt University, Nashville, TN 37212, USA.
Am J Hum Genet. 2018 Mar 1;102(3):415-426. doi: 10.1016/j.ajhg.2018.01.017. Epub 2018 Feb 15.
The spatial distribution of genetic variation within proteins is shaped by evolutionary constraint and provides insight into the functional importance of protein regions and the potential pathogenicity of protein alterations. Here, we comprehensively evaluate the 3D spatial patterns of human germline and somatic variation in 6,604 experimentally derived protein structures and 33,144 computationally derived homology models covering 77% of all human proteins. Using a systematic approach, we quantify differences in the spatial distributions of neutral germline variants, disease-causing germline variants, and recurrent somatic variants. Neutral missense variants exhibit a general trend toward spatial dispersion, which is driven by constraint on core residues. In contrast, germline disease-causing variants are generally clustered in protein structures and form clusters more frequently than recurrent somatic variants identified from tumor sequencing. In total, we identify 215 proteins with significant spatial constraints on the distribution of disease-causing missense variants in experimentally derived protein structures, only 65 (30%) of which have been previously reported. This analysis identifies many clusters not detectable from sequence information alone; only 12% of proteins with significant clustering in 3D were identified from similar analyses of linear protein sequence. Furthermore, spatial analyses of mutations in homology-based structural models are highly correlated with those from experimentally derived structures, supporting the use of computationally derived models. Our approach highlights significant differences in the spatial constraints on different classes of mutations in protein structure and identifies regions of potential function within individual proteins.
蛋白质中遗传变异的空间分布受进化约束的影响,并为理解蛋白质区域的功能重要性以及蛋白质改变的潜在致病性提供了线索。在这里,我们全面评估了 6604 个实验衍生蛋白质结构和 33144 个计算衍生同源模型中人类种系和体细胞变异的 3D 空间模式,这些结构和模型涵盖了所有人类蛋白质的 77%。我们采用系统的方法,量化了中性种系变异、致病变异体和反复发生的体细胞变异在空间分布上的差异。中性错义变异表现出普遍的空间分散趋势,这是由核心残基的约束驱动的。相比之下,种系致病变异体通常在蛋白质结构中聚集,并比从肿瘤测序中识别出的反复发生的体细胞变异更频繁地形成簇。总的来说,我们在实验衍生的蛋白质结构中确定了 215 个对致病变异体错义变异分布有显著空间约束的蛋白质,其中只有 65 个(30%)之前已经报道过。这种分析确定了许多仅从序列信息无法检测到的簇;在基于线性蛋白质序列的类似分析中,只有 12%具有显著聚类的蛋白质被识别出来。此外,基于同源结构模型的突变的空间分析与从实验衍生结构中得出的分析高度相关,支持使用计算衍生模型。我们的方法突出了蛋白质结构中不同类别突变的空间约束之间的显著差异,并确定了单个蛋白质中潜在功能区域。