Chantzi Nikol, Mareboina Manvita, Konnaris Maxwell A, Montgomery Austin, Patsakis Michail, Mouratidis Ioannis, Georgakopoulos-Soares Ilias
Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA.
Department of Statistics, Penn State University, University Park, PA, 16802, USA.
NAR Genom Bioinform. 2024 Apr 4;6(2):lqae029. doi: 10.1093/nargab/lqae029. eCollection 2024 Jun.
The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the ² performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved ² performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.
核酸和肽短序列在生物体基因组和蛋白质组中的普遍性尚未得到充分研究。我们检查了45785个参考基因组和21871个参考蛋白质组,涵盖古细菌、细菌、真核生物和病毒,以计算其中短序列的稀有性。为了捕捉这一点,我们开发了一种衡量自然界中每个序列稀有性的指标——稀有性指数。我们发现,稀有寡肽序列中某些二肽的频率比预期低数百倍,而任何二核苷酸的情况并非如此。我们还生成了预测回归模型,分别推断自然界或生命和病毒的每个域内核酸和蛋白质组序列的稀有性。当分别检查生命的三个域和病毒中的每一个时,从单肽和二肽预测5聚体肽稀有性的模型的²性能在0.814至0.932之间。一个从单核苷酸和二核苷酸预测10聚体寡核苷酸稀有性的单独模型的²性能在0.408至0.606之间。我们的结果表明,核酸序列的单核苷酸和二核苷酸组成以及肽序列的单肽和二肽组成可以解释它们在自然界中频率变化的很大一部分。