Eden E, Brunak S
Center for Biological Sequence Analysis, Biocentrum-DTU Building 208, Technical University of Denmark, DK-2800 Lyngby, Denmark.
Nucleic Acids Res. 2004 Feb 11;32(3):1131-42. doi: 10.1093/nar/gkh273. Print 2004.
Prediction of splice sites in non-coding regions of genes is one of the most challenging aspects of gene structure recognition. We perform a rigorous analysis of such splice sites embedded in human 5' untranslated regions (UTRs), and investigate correlations between this class of splice sites and other features found in the adjacent exons and introns. By restricting the training of neural network algorithms to 'pure' UTRs (not extending partially into protein coding regions), we for the first time investigate the predictive power of the splicing signal proper, in contrast to conventional splice site prediction, which typically relies on the change in sequence at the transition from protein coding to non-coding. By doing so, the algorithms were able to pick up subtler splicing signals that were otherwise masked by 'coding' noise, thus enhancing significantly the prediction of 5' UTR splice sites. For example, the non-coding splice site predicting networks pick up compositional and positional bias in the 3' ends of non-coding exons and 5' non-coding intron ends, where cytosine and guanine are over-represented. This compositional bias at the true UTR donor sites is also visible in the synaptic weights of the neural networks trained to identify UTR donor sites. Conventional splice site prediction methods perform poorly in UTRs because the reading frame pattern is absent. The NetUTR method presented here performs 2-3-fold better compared with NetGene2 and GenScan in 5' UTRs. We also tested the 5' UTR trained method on protein coding regions, and discovered, surprisingly, that it works quite well (although it cannot compete with NetGene2). This indicates that the local splicing pattern in UTRs and coding regions is largely the same. The NetUTR method is made publicly available at www.cbs.dtu.dk/services/NetUTR.
预测基因非编码区的剪接位点是基因结构识别中最具挑战性的方面之一。我们对嵌入人类5'非翻译区(UTR)的此类剪接位点进行了严格分析,并研究了这类剪接位点与相邻外显子和内含子中其他特征之间的相关性。通过将神经网络算法的训练限制在“纯”UTR(不部分延伸到蛋白质编码区),我们首次研究了剪接信号本身的预测能力,这与传统的剪接位点预测不同,传统方法通常依赖于从蛋白质编码到非编码转变时序列的变化。这样做,算法能够捕捉到原本被“编码”噪声掩盖的更微妙的剪接信号,从而显著提高了对5'UTR剪接位点的预测。例如,非编码剪接位点预测网络在非编码外显子的3'端和5'非编码内含子末端发现了组成和位置偏差,其中胞嘧啶和鸟嘌呤的含量过高。在训练用于识别UTR供体位点的神经网络的突触权重中,真正UTR供体位点的这种组成偏差也很明显。传统的剪接位点预测方法在UTR中表现不佳,因为不存在阅读框模式。这里提出的NetUTR方法在5'UTR中比NetGene2和GenScan表现好2至3倍。我们还在蛋白质编码区测试了5'UTR训练的方法,令人惊讶的是,它运行得相当好(尽管它无法与NetGene2竞争)。这表明UTR和编码区的局部剪接模式在很大程度上是相同的。NetUTR方法可在www.cbs.dtu.dk/services/NetUTR上公开获取。