Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, Colorado, United States of America.
PLoS Comput Biol. 2024 May 15;20(5):e1011372. doi: 10.1371/journal.pcbi.1011372. eCollection 2024 May.
Low-complexity domains (LCDs) in proteins are typically enriched in one or two predominant amino acids. As a result, LCDs often exhibit unusual structural/biophysical tendencies and can occupy functional niches. However, for each organism, protein sequences must be compatible with intracellular biomolecules and physicochemical environment, both of which vary from organism to organism. This raises the possibility that LCDs may occupy sequence spaces in select organisms that are otherwise prohibited in most organisms. Here, we report a comprehensive survey and functional analysis of LCDs in all known reference proteomes (>21k organisms), with added focus on rare and unusual types of LCDs. LCDs were classified according to both the primary amino acid and secondary amino acid in each LCD sequence, facilitating detailed comparisons of LCD class frequencies across organisms. Examination of LCD classes at different depths (i.e., domain of life, organism, protein, and per-residue levels) reveals unique facets of LCD frequencies and functions. To our surprise, all 400 LCD classes occur in nature, although some are exceptionally rare. A number of rare classes can be defined for each domain of life, with many LCD classes appearing to be eukaryote-specific. Certain LCD classes were consistently associated with identical functions across many organisms, particularly in eukaryotes. Our analysis methods enable simultaneous, direct comparison of all LCD classes between individual organisms, resulting in a proteome-scale view of differences in LCD frequencies and functions. Together, these results highlight the remarkable diversity and functional specificity of LCDs across all known life forms.
蛋白质中的低复杂度结构域(LCDs)通常富含一种或两种主要氨基酸。因此,LCDs 通常表现出异常的结构/生物物理倾向,并能占据功能生态位。然而,对于每个生物体,蛋白质序列必须与细胞内生物分子和物理化学环境兼容,而这些在不同的生物体中都有所不同。这就提出了一种可能性,即 LCDs 可能占据某些特定生物体的序列空间,而在大多数生物体中则被禁止。在这里,我们对所有已知参考蛋白质组(>21000 个生物体)中的 LCD 进行了全面的调查和功能分析,并特别关注罕见和不常见的 LCD 类型。根据每个 LCD 序列中的主要氨基酸和次要氨基酸对 LCD 进行分类,便于在生物体之间对 LCD 类频率进行详细比较。在不同的深度(即生命领域、生物体、蛋白质和每个残基水平)检查 LCD 类,揭示了 LCD 频率和功能的独特方面。令我们惊讶的是,所有 400 个 LCD 类都存在于自然界中,尽管有些非常罕见。对于每一个生命领域,都可以定义一些罕见的类别,其中许多 LCD 类别似乎是真核生物特有的。某些 LCD 类在许多生物体中始终与相同的功能相关联,尤其是在真核生物中。我们的分析方法使我们能够在个体生物体之间同时直接比较所有的 LCD 类,从而在蛋白质组范围内观察到 LCD 频率和功能的差异。总之,这些结果突出了 LCD 在所有已知生命形式中的显著多样性和功能特异性。