Falgarone Théo, Villain Etienne, Richard Francois, Osmanli Zarifa, Kajava Andrey V
Centre de Recherche en Biologie cellulaire de Montpellier, CNRS, Université Montpellier, Montpellier, 34293, France.
Biophysics Institute, Ministry of Science and Education of Azerbaijan Republic, Az1141, Baku, Azerbaijan.
Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad183.
Loss of solubility usually leads to the detrimental elimination of protein function. In some cases, the protein aggregation is also required for beneficial functions. Given the duality of this phenomenon, it remains a fundamental question how natural selection controls the aggregation. The exponential growth of genomic sequence data and recent progress with in silico predictors of the aggregation allows approaching this problem by a large-scale bioinformatics analysis. Most of the aggregation-prone regions are hidden within the 3D structure, rendering them inaccessible for the intermolecular interactions responsible for aggregation. Thus, the most realistic census of the aggregation-prone regions requires crossing aggregation prediction with information about the location of the natively unfolded regions. This allows us to detect so-called 'exposed aggregation-prone regions' (EARs). Here, we analyzed the occurrence and distribution of the EARs in 76 reference proteomes from the three kingdoms of life. For this purpose, we used a bioinformatics pipeline, which provides a consensual result based on several predictors of aggregation. Our analysis revealed a number of new statistically significant correlations about the presence of EARs in different organisms, their dependence on protein length, cellular localizations, co-occurrence with short linear motifs and the level of protein expression. We also obtained a list of proteins with the conserved aggregation-prone sequences for further experimental tests. Insights gained from this work led to a deeper understanding of the relationship between protein evolution and aggregation.
溶解度的丧失通常会导致蛋白质功能的有害消除。在某些情况下,蛋白质聚集对于有益功能也是必需的。鉴于这种现象的双重性,自然选择如何控制聚集仍然是一个基本问题。基因组序列数据的指数增长以及聚集的计算机预测器的最新进展使得通过大规模生物信息学分析来解决这个问题成为可能。大多数易于聚集的区域隐藏在三维结构中,使得它们无法进行负责聚集的分子间相互作用。因此,对易于聚集区域最现实的普查需要将聚集预测与关于天然未折叠区域位置的信息相结合。这使我们能够检测到所谓的“暴露的易于聚集区域”(EARs)。在这里,我们分析了生命三界76个参考蛋白质组中EARs的出现和分布。为此,我们使用了一个生物信息学流程,该流程基于多个聚集预测器提供一个共识结果。我们的分析揭示了一些关于不同生物体中EARs的存在、它们对蛋白质长度的依赖性、细胞定位、与短线性基序的共现以及蛋白质表达水平的新的统计学显著相关性。我们还获得了具有保守的易于聚集序列的蛋白质列表,用于进一步的实验测试。从这项工作中获得的见解加深了对蛋白质进化与聚集之间关系的理解。