Wei Chaochun, Brent Michael R
Laboratory for Computational Genomics and Department of Computer Science and Engineering, Washington University, One Brookings Drive, St, Louis, MO 63130, USA.
BMC Bioinformatics. 2006 Jul 3;7:327. doi: 10.1186/1471-2105-7-327.
ESTs are a tremendous resource for determining the exon-intron structures of genes, but even extensive EST sequencing tends to leave many exons and genes untouched. Gene prediction systems based exclusively on EST alignments miss these exons and genes, leading to poor sensitivity. De novo gene prediction systems, which ignore ESTs in favor of genomic sequence, can predict such "untouched" exons, but they are less accurate when predicting exons to which ESTs align. TWINSCAN is the most accurate de novo gene finder available for nematodes and N-SCAN is the most accurate for mammals, as measured by exact CDS gene prediction and exact exon prediction.
TWINSCAN_EST is a new system that successfully combines EST alignments with TWINSCAN. On the whole C. elegans genome TWINSCAN_EST shows 14% improvement in sensitivity and 13% in specificity in predicting exact gene structures compared to TWINSCAN without EST alignments. Not only are the structures revealed by EST alignments predicted correctly, but these also constrain the predictions without alignments, improving their accuracy. For the human genome, we used the same approach with N-SCAN, creating N-SCAN_EST. On the whole genome, N-SCAN_EST produced a 6% improvement in sensitivity and 1% in specificity of exact gene structure predictions compared to N-SCAN.
TWINSCAN_EST and N-SCAN_EST are more accurate than TWINSCAN and N-SCAN, while retaining their ability to discover novel genes to which no ESTs align. Thus, we recommend using the EST versions of these programs to annotate any genome for which EST information is available.TWINSCAN_EST and N-SCAN_EST are part of the TWINSCAN open source software package http://genes.cse.wustl.edu/distribution/download_TS.html.
EST(表达序列标签)是确定基因外显子-内含子结构的重要资源,但即便进行大量的EST测序,仍有许多外显子和基因未被涉及。仅基于EST比对的基因预测系统会遗漏这些外显子和基因,导致敏感性较差。从头基因预测系统忽略EST而倾向于基因组序列,能够预测此类“未涉及”的外显子,但在预测与EST比对的外显子时准确性较低。通过精确的CDS基因预测和精确的外显子预测衡量,TWINSCAN是对线虫最准确的从头基因预测工具,而N-SCAN对哺乳动物最准确。
TWINSCAN_EST是一个成功将EST比对与TWINSCAN相结合的新系统。与未进行EST比对的TWINSCAN相比,在整个秀丽隐杆线虫基因组上,TWINSCAN_EST在预测精确基因结构时,敏感性提高了14%,特异性提高了13%。不仅EST比对揭示的结构能被正确预测,而且这些结构还限制了无比对情况下的预测,提高了其准确性。对于人类基因组,我们对N-SCAN采用了相同方法,创建了N-SCAN_EST。与N-SCAN相比,在整个基因组上,N-SCAN_EST在精确基因结构预测中的敏感性提高了6%,特异性提高了1%。
TWINSCAN_EST和N-SCAN_EST比TWINSCAN和N-SCAN更准确,同时保留了发现无EST比对的新基因的能力。因此,我们建议使用这些程序的EST版本来注释任何有EST信息的基因组。TWINSCAN_EST和N-SCAN_EST是TWINSCAN开源软件包(http://genes.cse.wustl.edu/distribution/download_TS.html)的一部分。