Laboratorio de Fisiología de Proteínas, Facultad de Ciencias Exactas y Naturales, Consejo Nacional de lnvestigaciones Cientificas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos Aires, Argentina.
Departamento de Matematica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina.
PLoS One. 2021 May 3;16(5):e0248841. doi: 10.1371/journal.pone.0248841. eCollection 2021.
Linear motifs are short protein subsequences that mediate protein interactions. Hundreds of motif classes including thousands of motif instances are known. Our theory estimates how many motif classes remain undiscovered. As commonly done, we describe motif classes as regular expressions specifying motif length and the allowed amino acids at each motif position. We measure motif specificity for a pair of motif classes by quantifying how many motif-discriminating positions prevent a protein subsequence from matching the two classes at once. We derive theorems for the maximal number of motif classes that can simultaneously maintain a certain number of motif-discriminating positions between all pairs of classes in the motif universe, for a given amino acid alphabet. We also calculate the fraction of all protein subsequences that would belong to a motif class if all potential motif classes came into existence. Naturally occurring pairs of motif classes present most often a single motif-discriminating position. This mild specificity maximizes the potential number of coexisting motif classes, the expansion of the motif universe due to amino acid modifications and the fraction of amino acid sequences that code for a motif instance. As a result, thousands of linear motif classes may remain undiscovered.
线性基序是介导蛋白质相互作用的短蛋白质序列。已知数百种基序类别,包括数千个基序实例。我们的理论估计了仍未发现的基序类别数量。通常,我们将基序类别描述为指定基序长度和每个基序位置允许的氨基酸的正则表达式。我们通过量化阻止蛋白质序列同时匹配两个类别的基序区分位置的数量来衡量基序类别的特异性。我们为给定的氨基酸字母表推导出了在基序宇宙中所有类别对之间同时保持特定数量的基序区分位置的最大基序类别数量的定理。我们还计算了如果所有潜在的基序类别都存在,那么属于基序类别的所有蛋白质序列的分数。自然存在的基序类别对通常只有一个基序区分位置。这种温和的特异性最大限度地提高了共存基序类别的潜在数量、由于氨基酸修饰而扩展的基序宇宙以及编码基序实例的氨基酸序列的分数。因此,可能仍有数千种线性基序类别未被发现。