Yeh R F, Lim L P, Burge C B
Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.
Genome Res. 2001 May;11(5):803-16. doi: 10.1101/gr.175701.
With the human genome sequence approaching completion, a major challenge is to identify the locations and encoded protein sequences of all human genes. To address this problem we have developed a new gene identification algorithm, GenomeScan, which combines exon-intron and splice signal models with similarity to known protein sequences in an integrated model. Extensive testing shows that GenomeScan can accurately identify the exon-intron structures of genes in finished or draft human genome sequence with a low rate of false-positives. Application of GenomeScan to 2.7 billion bases of human genomic DNA identified at least 20,000-25,000 human genes out of an estimated 30,000-40,000 present in the genome. The results show an accurate and efficient automated approach for identifying genes in higher eukaryotic genomes and provide a first-level annotation of the draft human genome.
随着人类基因组序列即将完成,一项重大挑战是确定所有人类基因的位置和编码的蛋白质序列。为了解决这个问题,我们开发了一种新的基因识别算法——基因组扫描(GenomeScan),它将外显子 - 内含子和剪接信号模型与已知蛋白质序列的相似性整合到一个综合模型中。大量测试表明,基因组扫描能够准确识别完成或草图形式的人类基因组序列中基因的外显子 - 内含子结构,且假阳性率较低。将基因组扫描应用于27亿个碱基的人类基因组DNA,在基因组中估计存在的30000 - 40000个基因中,至少识别出了20000 - 25000个人类基因。结果显示了一种用于识别高等真核生物基因组中基因的准确且高效的自动化方法,并为人类基因组草图提供了一级注释。