基于序列的可折叠性评分与 AlphaFold2 预测相结合，以厘清蛋白质有序/无序连续统。

A sequence-based foldability score combined with AlphaFold2 predictions to disentangle the protein order/disorder continuum.

机构信息

Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France.

出版信息

Proteins. 2023 Apr;91(4):466-484. doi: 10.1002/prot.26441. Epub 2022 Nov 9.

DOI:10.1002/prot.26441

PMID:36306150

Abstract

Order and disorder govern protein functions, but there is a great diversity in disorder, from regions that are-and stay-fully disordered to conditional order. This diversity is still difficult to decipher even though it is encoded in the amino acid sequences. Here, we developed an analytic Python package, named pyHCA, to estimate the foldability of a protein segment from the only information of its amino acid sequence and based on a measure of its density in regular secondary structures associated with hydrophobic clusters, as defined by the hydrophobic cluster analysis (HCA) approach. The tool was designed by optimizing the separation between foldable segments from databases of disorder (DisProt) and order (SCOPe [soluble domains] and OPM [transmembrane domains]). It allows to specify the ratio between order, embodied by regular secondary structures (either participating in the hydrophobic core of well-folded 3D structures or conditionally formed in intrinsically disordered regions) and disorder. We illustrated the relevance of pyHCA with several examples and applied it to the sequences of the proteomes of 21 species ranging from prokaryotes and archaea to unicellular and multicellular eukaryotes, for which structure models are provided in the AlphaFold protein structure database. Cases of low-confidence scores related to disorder were distinguished from those of sequences that we identified as foldable but are still excluded from accurate modeling by AlphaFold2 due to a lack of sequence homologs or to compositional biases. Overall, our approach is complementary to AlphaFold2, providing guides to map structural innovations through evolutionary processes, at proteome and gene scales.

摘要

有序和无序控制着蛋白质的功能，但无序的形式多种多样，从完全无序且保持完全无序的区域到条件有序。尽管这种多样性已经被编码在氨基酸序列中，但即使如此，要理解它仍然很困难。在这里，我们开发了一个名为 pyHCA 的分析型 Python 包，该包可以根据氨基酸序列的唯一信息，并基于与疏水区分析（HCA）方法定义的疏水区簇相关的规则二级结构中密度的度量，来估计蛋白质片段的折叠能力。该工具是通过优化数据库中无序（DisProt）和有序（SCOPe[可溶性结构域]和 OPM[跨膜结构域]）的折叠片段之间的分离来设计的。它允许指定规则结构（参与高度折叠的 3D 结构的疏水区核心或条件形成于无序区域）和无序之间的比例。我们用几个例子说明了 pyHCA 的相关性，并将其应用于 21 个物种的蛋白质组序列，这些物种从原核生物和古菌到单细胞和多细胞真核生物，其结构模型在 AlphaFold 蛋白质结构数据库中提供。与无序相关的低置信度评分的情况与我们确定为可折叠但由于缺乏序列同源物或组成偏差而仍然被 AlphaFold2 排除在准确建模之外的序列的情况区分开来。总的来说，我们的方法是与 AlphaFold2 互补的，为在蛋白质组和基因水平上通过进化过程映射结构创新提供了指导。