Department of Biomedical Engineering, Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD 21218, USA.
Bioinformatics. 2011 Jul 1;27(13):1765-71. doi: 10.1093/bioinformatics/btr275. Epub 2011 May 5.
Accurate prediction of genes encoding small proteins (on the order of 50 amino acids or less) remains an elusive open problem in bioinformatics. Some of the best methods for gene prediction use either sequence composition analysis or sequence similarity to a known protein coding sequence. These methods often fail for small proteins, however, either due to a lack of experimentally verified small protein coding genes or due to the limited statistical significance of statistics on small sequences. Our approach is based upon the hypothesis that true small proteins will be under selective pressure for encoding the particular amino acid sequence, for ease of translation by the ribosome and for structural stability. This stability can be achieved either independently or as part of a larger protein complex. Given this assumption, it follows that small proteins should display conserved local protein structure properties much like larger proteins. Our method incorporates neural-net predictions for three local structure alphabets within a comparative genomic approach using a genomic alignment of 22 closely related bacteria genomes to generate predictions for whether or not a given open reading frame (ORF) encodes for a small protein.
We have applied this method to the complete genome for Escherichia coli strain K12 and looked at how well our method performed on a set of 60 experimentally verified small proteins from this organism. Out of a total of 11 407 possible ORFs, we found that 6 of the top 10 and 27 of the top 100 predictions belonged to the set of 60 experimentally verified small proteins. We found 35 of all the true small proteins within the top 200 predictions. We compared our method to Glimmer, using a default Glimmer protocol and a modified small ORF Glimmer protocol with a lower minimum size cutoff. The default Glimmer protocol identified 16 of the true small proteins (all in the top 200 predictions), but failed to predict on 34 due to size cutoffs. The small ORF Glimmer protocol made predictions for all the experimentally verified small proteins but only contained 9 of the 60 true small proteins within the top 200 predictions.
准确预测编码小蛋白(约 50 个氨基酸或更少)的基因仍然是生物信息学中一个难以捉摸的开放性问题。一些最好的基因预测方法要么使用序列组成分析,要么使用与已知蛋白质编码序列的序列相似性。然而,这些方法对于小蛋白往往不适用,要么是因为缺乏经过实验验证的小蛋白编码基因,要么是因为小序列的统计意义有限。我们的方法基于这样的假设,即真正的小蛋白将受到编码特定氨基酸序列的选择压力,以便核糖体易于翻译和结构稳定。这种稳定性可以独立实现,也可以作为更大蛋白质复合物的一部分。根据这一假设,可以得出结论,小蛋白应该显示出与较大蛋白相似的保守局部蛋白质结构特性。我们的方法结合了神经网对三个局部结构字母表的预测,使用 22 个密切相关细菌基因组的基因组比对进行比较基因组分析,以生成给定开放阅读框(ORF)是否编码小蛋白的预测。
我们将这种方法应用于大肠杆菌 K12 菌株的完整基因组,并研究了我们的方法在该生物的 60 个经过实验验证的小蛋白集合上的表现如何。在总共 11407 个可能的 ORF 中,我们发现排名前 10 的 ORF 中有 6 个和排名前 100 的 ORF 中有 27 个属于这 60 个经过实验验证的小蛋白集合。我们在排名前 200 的预测中找到了所有真正的小蛋白中的 35 个。我们将我们的方法与 Glimmer 进行了比较,使用默认的 Glimmer 协议和一个修改后的小 ORF Glimmer 协议,该协议的最小尺寸截止值较低。默认的 Glimmer 协议识别了 16 个真正的小蛋白(全部在排名前 200 的预测中),但由于尺寸截止值,未能预测到 34 个。小 ORF Glimmer 协议对所有经过实验验证的小蛋白都进行了预测,但在排名前 200 的预测中仅包含 60 个真正小蛋白中的 9 个。