Evans Patrick, Cox Nancy J, Gamazon Eric R
Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, United States of America.
Clare Hall, University of Cambridge, Cambridge, United Kingdom.
PeerJ. 2020 Jul 21;8:e9554. doi: 10.7717/peerj.9554. eCollection 2020.
The development of explanatory models of protein sequence evolution has broad implications for our understanding of cellular biology, population history, and disease etiology. Here we analyze the GTEx transcriptome resource to quantify the effect of the transcriptome on protein sequence evolution in a multi-tissue framework. We find substantial variation among the central nervous system tissues in the effect of expression variance on evolutionary rate, with highly variable genes in the cortex showing significantly greater purifying selection than highly variable genes in subcortical regions (Mann-Whitney U = 1.4 × 10). The remaining tissues cluster in observed expression correlation with evolutionary rate, enabling evolutionary analysis of genes in diverse physiological systems, including digestive, reproductive, and immune systems. Importantly, the tissue in which a gene attains its maximum expression variance significantly varies ( = 5.55 × 10) with evolutionary rate, suggesting a tissue-anchored model of protein sequence evolution. Using a large-scale reference resource, we show that the tissue-anchored model provides a transcriptome-based approach to predicting the primary affected tissue of developmental disorders. Using gradient boosted regression trees to model evolutionary rate under a range of model parameters, selected features explain up to 62% of the variation in evolutionary rate and provide additional support for the tissue model. Finally, we investigate several methodological implications, including the importance of evolutionary-rate-aware gene expression imputation models using genetic data for improved search for disease-associated genes in transcriptome-wide association studies. Collectively, this study presents a comprehensive transcriptome-based analysis of a range of factors that may constrain molecular evolution and proposes a novel framework for the study of gene function and disease mechanism.
蛋白质序列进化解释模型的发展对我们理解细胞生物学、种群历史和疾病病因具有广泛影响。在此,我们分析基因型-组织表达(GTEx)转录组资源,以在多组织框架中量化转录组对蛋白质序列进化的影响。我们发现,中枢神经系统组织中,表达变异对进化速率的影响存在显著差异,皮层中高变异性基因比皮层下区域的高变异性基因表现出明显更强的纯化选择(曼-惠特尼U检验 = 1.4×10)。其余组织在观察到的表达与进化速率的相关性上聚类,从而能够对包括消化、生殖和免疫系统在内的多种生理系统中的基因进行进化分析。重要的是,基因达到其最大表达变异的组织随进化速率显著变化( = 5.55×10),这表明存在一种基于组织的蛋白质序列进化模型。利用大规模参考资源,我们表明基于组织的模型提供了一种基于转录组的方法来预测发育障碍的主要受影响组织。使用梯度提升回归树在一系列模型参数下对进化速率进行建模,所选特征可解释高达62%的进化速率变异,并为组织模型提供了额外支持。最后,我们研究了几个方法学意义,包括在全转录组关联研究中使用遗传数据的进化速率感知基因表达插补模型对于改进疾病相关基因搜索的重要性。总的来说,本研究对一系列可能限制分子进化的因素进行了全面的基于转录组的分析,并提出了一个研究基因功能和疾病机制的新框架。