Moskal William A, Wu Hank C, Underwood Beverly A, Wang Wei, Town Christopher D, Xiao Yongli
The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA.
BMC Genomics. 2007 Jan 17;8:18. doi: 10.1186/1471-2164-8-18.
Several lines of evidence support the existence of novel genes and other transcribed units which have not yet been annotated in the Arabidopsis genome. Two gene prediction programs which make use of comparative genomic analysis, Twinscan and EuGene, have recently been deployed on the Arabidopsis genome. The ability of these programs to make use of sequence data from other species has allowed both Twinscan and EuGene to predict over 1000 genes that are intergenic with respect to the most recent annotation release. A high throughput RACE pipeline was utilized in an attempt to verify the structure and expression of these novel genes.
1,071 un-annotated loci were targeted by RACE, and full length sequence coverage was obtained for 35% of the targeted genes. We have verified the structure and expression of 378 genes that were not present within the most recent release of the Arabidopsis genome annotation. These 378 genes represent a structurally diverse set of transcripts and encode a functionally diverse set of proteins.
We have investigated the accuracy of the Twinscan and EuGene gene prediction programs and found them to be reliable predictors of gene structure in Arabidopsis. Several hundred previously un-annotated genes were validated by this work. Based upon this information derived from these efforts it is likely that the Arabidopsis genome annotation continues to overlook several hundred protein coding genes.
有几条证据支持拟南芥基因组中存在尚未注释的新基因和其他转录单元。最近,利用比较基因组分析的两个基因预测程序Twinscan和EuGene已应用于拟南芥基因组。这些程序利用其他物种序列数据的能力使Twinscan和EuGene都能预测出1000多个相对于最新注释版本而言位于基因间区域的基因。为了验证这些新基因的结构和表达,采用了一种高通量RACE方法。
RACE针对1071个未注释的基因座,35%的目标基因获得了全长序列覆盖。我们已经验证了拟南芥基因组最新版本中不存在的378个基因的结构和表达。这378个基因代表了一组结构多样的转录本,并编码一组功能多样的蛋白质。
我们研究了Twinscan和EuGene基因预测程序的准确性,发现它们是拟南芥基因结构的可靠预测工具。这项工作验证了几百个以前未注释的基因。基于这些努力获得的信息,拟南芥基因组注释可能仍然遗漏了几百个蛋白质编码基因。