Wu Jing
Department of Statistics, Carnegie Mellon University, PA 15213, USA.
Adv Bioinformatics. 2010;2010:287070. doi: 10.1155/2010/287070. Epub 2010 Mar 8.
Proposed is a procedure to test whether a genomic sequence contains coding DNA, called a coding potential region. The procedure tests the coding potential of conserved short genomic sequence, in which the assumptions on the probability models of gene structures are relaxed. Thus, it is expected to provide additional candidate regions that contain coding DNAs to the current genomic database. The procedure was applied to the set of highly conserved human-mouse sequences in the genome database at the University of California at Santa Cruz. For sequences containing RefSeq coding exons, the procedure detected 91.3% regions having coding potential in this set, which covers 83% of the human RefSeq coding exons, at a 2.6% false positive rate. The procedure detected 12,688 novel short regions with coding potential at the false discovery rate <0.05; 65.7% of the novel regions are between annotated genes.
提出了一种测试基因组序列是否包含编码DNA(称为编码潜能区域)的程序。该程序测试保守短基因组序列的编码潜能,其中放宽了对基因结构概率模型的假设。因此,预计它将为当前的基因组数据库提供包含编码DNA的额外候选区域。该程序应用于加利福尼亚大学圣克鲁兹分校基因组数据库中高度保守的人类-小鼠序列集。对于包含RefSeq编码外显子的序列,该程序在该集合中检测到91.3%具有编码潜能的区域,覆盖了83%的人类RefSeq编码外显子,假阳性率为2.6%。该程序在错误发现率<0.05时检测到12,688个具有编码潜能的新短区域;65.7%的新区域位于注释基因之间。