Chuang Trees-Juen, Lin Wen-Chang, Lee Hurng-Chun, Wang Chi-Wei, Hsiao Keh-Lin, Wang Zi-Hao, Shieh Danny, Lin Simon C, Ch'ang Lan-Yang
Bioinformatics Research Center, Institute of Biomedical Sciences, Academia Sinica, Taipei 11529, Taiwan.
Genome Res. 2003 Feb;13(2):313-22. doi: 10.1101/gr.313703.
DNA is a universal language encrypted with biological instruction for life. In higher organisms, the genetic information is preserved predominantly in an organized exon/intron structure. When a gene is expressed, the exons are spliced together to form the transcript for protein synthesis. We have developed a complexity reduction algorithm for sequence analysis (CRASA) that enables direct alignment of cDNA sequences to the genome. This method features a progressive data structure in hierarchical orders to facilitate a fast and efficient search mechanism. CRASA implementation was tested with already annotated genomic sequences in two benchmark data sets and compared with 15 annotation programs (10 ab initio and 5 homology-based approaches) against the EST database. By the use of layered noise filters, the complexity of CRASA-matched data was reduced exponentially. The results from the benchmark tests showed that CRASA annotation excelled in both the sensitivity and specificity categories. When CRASA was applied to the analysis of human Chromosomes 21 and 22, an additional 83 potential genes were identified. With its large-scale processing capability, CRASA can be used as a robust tool for genome annotation with high accuracy by matching the EST sequences precisely to the genomic sequences.
DNA是一种用生命的生物学指令加密的通用语言。在高等生物中,遗传信息主要保存在有组织的外显子/内含子结构中。当一个基因表达时,外显子会拼接在一起形成用于蛋白质合成的转录本。我们开发了一种用于序列分析的复杂度降低算法(CRASA),该算法能够将cDNA序列直接与基因组进行比对。这种方法具有分层顺序的渐进数据结构,以促进快速高效的搜索机制。使用两个基准数据集中已注释的基因组序列对CRASA的实现进行了测试,并与针对EST数据库的15个注释程序(10个从头开始的方法和5个基于同源性的方法)进行了比较。通过使用分层噪声滤波器,CRASA匹配数据的复杂度呈指数级降低。基准测试结果表明,CRASA注释在敏感性和特异性类别方面均表现出色。当将CRASA应用于人类21号和22号染色体的分析时,又鉴定出了83个潜在基因。凭借其大规模处理能力,通过将EST序列与基因组序列精确匹配,CRASA可以用作一种高精度的强大基因组注释工具。