Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
Swiss Institute for Bioinformatics, University of Lausanne, Lausanne, Switzerland.
Genome Biol. 2023 Jun 8;24(1):135. doi: 10.1186/s13059-023-02973-2.
In every living species, the function of a protein depends on its organization of structural domains, and the length of a protein is a direct reflection of this. Because every species evolved under different evolutionary pressures, the protein length distribution, much like other genomic features, is expected to vary across species but has so far been scarcely studied.
Here we evaluate this diversity by comparing protein length distribution across 2326 species (1688 bacteria, 153 archaea, and 485 eukaryotes). We find that proteins tend to be on average slightly longer in eukaryotes than in bacteria or archaea, but that the variation of length distribution across species is low, especially compared to the variation of other genomic features (genome size, number of proteins, gene length, GC content, isoelectric points of proteins). Moreover, most cases of atypical protein length distribution appear to be due to artifactual gene annotation, suggesting the actual variation of protein length distribution across species is even smaller.
These results open the way for developing a genome annotation quality metric based on protein length distribution to complement conventional quality measures. Overall, our findings show that protein length distribution between living species is more uniform than previously thought. Furthermore, we also provide evidence for a universal selection on protein length, yet its mechanism and fitness effect remain intriguing open questions.
在每个生物物种中,蛋白质的功能取决于其结构域的组织,而蛋白质的长度则直接反映了这一点。由于每个物种都是在不同的进化压力下进化而来的,因此蛋白质长度的分布与其他基因组特征一样,预计会在不同物种之间有所差异,但迄今为止,这方面的研究还很少。
通过比较 2326 个物种(1688 种细菌、153 种古菌和 485 种真核生物)的蛋白质长度分布,我们评估了这种多样性。我们发现,与细菌或古菌相比,真核生物中的蛋白质平均长度略长,但物种间长度分布的变化很小,尤其是与其他基因组特征(基因组大小、蛋白质数量、基因长度、GC 含量、蛋白质等电点)的变化相比。此外,大多数非典型蛋白质长度分布的情况似乎是由于基因注释的人为因素造成的,这表明物种间蛋白质长度分布的实际变化甚至更小。
这些结果为开发基于蛋白质长度分布的基因组注释质量度量标准开辟了道路,以补充传统的质量度量标准。总的来说,我们的研究结果表明,与之前的想象相比,现存物种之间的蛋白质长度分布更加统一。此外,我们还提供了蛋白质长度普遍受到选择的证据,但这种选择的机制和适应度效应仍然是令人着迷的开放性问题。