Saeys Yvan, Abeel Thomas, Degroeve Sven, Van de Peer Yves
Department of Plant Systems Biology, VIB, Technologiepark 927, B-9052 Ghent, Belgium.
Bioinformatics. 2007 Jul 1;23(13):i418-23. doi: 10.1093/bioinformatics/btm177.
The correct identification of translation initiation sites (TIS) remains a challenging problem for computational methods that automatically try to solve this problem. Furthermore, the lion's share of these computational techniques focuses on the identification of TIS in transcript data. However, in the gene prediction context the identification of TIS occurs on the genomic level, which makes things even harder because at the genome level many more pseudo-TIS occur, resulting in models that achieve a higher number of false positive predictions.
In this article, we evaluate the performance of several 'simple' TIS recognition methods at the genomic level, and compare them to state-of-the-art models for TIS prediction in transcript data. We conclude that the simple methods largely outperform the complex ones at the genomic scale, and we propose a new model for TIS recognition at the genome level that combines the strengths of these simple models. The new model obtains a false positive rate of 0.125 at a sensitivity of 0.80 on a well annotated human chromosome (chromosome 21). Detailed analyses show that the model is useful, both on its own and in a simple gene prediction setting.
Datafiles and a web interface for the StartScan program are available at http://bioinformatics.psb.ugent.be/supplementary_data/.
对于试图自动解决该问题的计算方法而言,正确识别翻译起始位点(TIS)仍然是一个具有挑战性的问题。此外,这些计算技术大多集中于在转录本数据中识别TIS。然而,在基因预测背景下,TIS的识别是在基因组水平上进行的,这使得情况变得更加困难,因为在基因组水平上会出现更多的假TIS,导致模型产生更高数量的假阳性预测。
在本文中,我们在基因组水平上评估了几种“简单”的TIS识别方法的性能,并将它们与转录本数据中TIS预测的最先进模型进行比较。我们得出结论,在基因组规模上,简单方法在很大程度上优于复杂方法,并且我们提出了一种在基因组水平上识别TIS的新模型,该模型结合了这些简单模型的优势。在一条注释良好的人类染色体(21号染色体)上,新模型在灵敏度为0.80时的假阳性率为0.125。详细分析表明,该模型本身以及在简单的基因预测设置中都是有用的。
StartScan程序的数据文件和网络界面可在http://bioinformatics.psb.ugent.be/supplementary_data/获取。