Wu Jing
Department of Statistics, Purdue University, 150 N, University Street, West Lafayette, IN 47906, USA.
BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S13. doi: 10.1186/1471-2164-9-S2-S13.
Computational gene prediction tools routinely generate large volumes of predicted coding exons (putative exons). One common limitation of these tools is the relatively low specificity due to the large amount of non-coding regions.
A statistical approach is developed that largely improves the gene prediction specificity. The key idea is to utilize the evolutionary conservation principle relative to the coding exons. By first exploiting the homology between genomes of two related species, a probability model for the evolutionary conservation pattern of codons across different genomes is developed. A probability model for the dependency between adjacent codons/triplets is added to differentiate coding exons and random sequences. Finally, the log odds ratio is developed to classify putative exons into the group of coding exons and the group of non-coding regions.
The method was tested on pre-aligned human-mouse sequences where the putative exons are predicted by GENSCAN and TWINSCAN. The proposed method is able to improve the exon specificity by 73% and 32% respectively, while the loss of the sensitivity < or = 1%. The method also keeps 98% of RefSeq gene structures that are correctly predicted by TWINSCAN when removing 26% of predicted genes that are in non-coding regions. The estimated number of true exons in TWINSCAN's predictions is 157,070. The results and the executable codes can be downloaded from http://www.stat.purdue.edu/~jingwu/codon/
The proposed method demonstrates an application of the evolutionary conservation principle to coding exons. It is a complementary method which can be used as an additional criteria to refine many existing gene predictions.
计算基因预测工具通常会生成大量预测的编码外显子(推定外显子)。这些工具的一个常见局限性是由于非编码区域数量众多,导致特异性相对较低。
开发了一种统计方法,该方法在很大程度上提高了基因预测的特异性。关键思想是利用相对于编码外显子的进化保守原则。首先通过利用两个相关物种基因组之间的同源性,开发了一个跨不同基因组密码子进化保守模式的概率模型。添加了相邻密码子/三联体之间依赖性的概率模型,以区分编码外显子和随机序列。最后,开发对数优势比,将推定外显子分类为编码外显子组和非编码区域组。
该方法在预先比对的人类 - 小鼠序列上进行了测试,其中推定外显子由GENSCAN和TWINSCAN预测。所提出的方法能够分别将外显子特异性提高73%和32%,而灵敏度损失≤1%。当去除26%位于非编码区域的预测基因时,该方法还保留了TWINSCAN正确预测的98%的RefSeq基因结构。TWINSCAN预测中真实外显子的估计数量为157,070。结果和可执行代码可从http://www.stat.purdue.edu/~jingwu/codon/下载。
所提出的方法展示了进化保守原则在编码外显子上的应用。它是一种补充方法,可作为完善许多现有基因预测的附加标准。