Research Program for Computational Science, Research and Development Group for Next-Generation Integrated Living Matter Simulation, Fusion of Data and Analysis Research and Development Team, RIKEN, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan.
BioData Min. 2010 Sep 28;3(1):6. doi: 10.1186/1756-0381-3-6.
Identifying protein-coding regions in genomic sequences is an essential step in genome analysis. It is well known that the proportion of false positives among genes predicted by current methods is high, especially when the exons are short. These false positives are problematic because they waste time and resources of experimental studies.
We developed GeneWaltz, a new filtering method that reduces the risk of false positives in gene finding. GeneWaltz utilizes a codon-to-codon substitution matrix that was constructed by comparing protein-coding regions from orthologous gene pairs between mouse and human genomes. Using this matrix, a scoring scheme was developed; it assigned higher scores to coding regions and lower scores to non-coding regions. The regions with high scores were considered candidate coding regions. One-dimensional Karlin-Altschul statistics was used to test the significance of the coding regions identified by GeneWaltz.
The proportion of false positives among genes predicted by GENSCAN and Twinscan were high, especially when the exons were short. GeneWaltz significantly reduced the ratio of false positives to all positives predicted by GENSCAN and Twinscan, especially when the exons were short.
GeneWaltz will be helpful in experimental genomic studies. GeneWaltz binaries and the matrix are available online at http://en.sourceforge.jp/projects/genewaltz/.
在基因组分析中,鉴定基因组序列中的蛋白质编码区是一个重要步骤。目前的方法预测的基因中假阳性的比例很高,尤其是外显子较短时。这些假阳性是有问题的,因为它们浪费了实验研究的时间和资源。
我们开发了 GeneWaltz,这是一种新的过滤方法,可以降低基因发现中假阳性的风险。GeneWaltz 利用了一个由鼠和人基因组中的同源基因对的蛋白质编码区比较构建的密码子到密码子替换矩阵。使用这个矩阵,我们开发了一个评分方案;它给编码区分配了更高的分数,给非编码区分配了更低的分数。得分较高的区域被认为是候选编码区。一维 Karlin-Altschul 统计用于测试 GeneWaltz 识别的编码区的显著性。
GENSCAN 和 Twinscan 预测的基因中假阳性的比例很高,尤其是外显子较短时。GeneWaltz 显著降低了 GENSCAN 和 Twinscan 预测的所有阳性基因中的假阳性比例,尤其是外显子较短时。
GeneWaltz 将有助于实验基因组学研究。GeneWaltz 的二进制文件和矩阵可在 http://en.sourceforge.jp/projects/genewaltz/ 在线获得。