Liang F, Holt I, Pertea G, Karamycheva S, Salzberg S L, Quackenbush J
The Institute for Genomic Research, Rockville, Maryland, USA.
Nat Genet. 2000 Jun;25(2):239-40. doi: 10.1038/76126.
Although sequencing of the human genome will soon be completed, gene identification and annotation remains a challenge. Early estimates suggested that there might be 60,000-100,000 (ref. 1) human genes, but recent analyses of the available data from EST sequencing projects have estimated as few as 45,000 (ref. 2) or as many as 140, 000 (ref. 3) distinct genes. The Chromosome 22 Sequencing Consortium estimated a minimum of 45,000 genes based on their annotation of the complete chromosome, although their data suggests there may be additional genes. The nearly 2,000,000 human ESTs in dbEST provide an important resource for gene identification and genome annotation, but these single-pass sequences must be carefully analysed to remove contaminating sequences, including those from genomic DNA, spurious transcription, and vector and bacterial sequences. We have developed a highly refined and rigorously tested protocol for cleaning, clustering and assembling EST sequences to produce high-fidelity consensus sequences for the represented genes (F.L. et al., manuscript submitted) and used this to create the TIGR Gene Indices-databases of expressed genes for human, mouse, rat and other species (http://www.tigr.org/tdb/tgi.html). Using highly refined and tested algorithms for EST analysis, we have arrived at two independent estimates indicating the human genome contains approximately 120,000 genes.
尽管人类基因组测序即将完成,但基因识别和注释仍然是一项挑战。早期估计表明,人类基因可能有60000 - 100000个(参考文献1),但最近对EST测序项目现有数据的分析估计,不同基因少至45000个(参考文献2),多至140000个(参考文献3)。22号染色体测序联盟根据对完整染色体的注释估计至少有45000个基因,尽管他们的数据表明可能还有其他基因。dbEST中近200万个人类EST为基因识别和基因组注释提供了重要资源,但这些单通道序列必须经过仔细分析,以去除污染序列,包括来自基因组DNA、假转录本以及载体和细菌的序列。我们已经开发出一种高度精细且经过严格测试的方案,用于清理、聚类和组装EST序列,以生成所代表基因的高保真共有序列(F.L.等人,待发表手稿),并以此创建了TIGR基因索引——人类、小鼠、大鼠和其他物种的表达基因数据库(http://www.tigr.org/tdb/tgi.html)。通过使用高度精细且经过测试的EST分析算法,我们得出了两个独立的估计结果,表明人类基因组包含约120000个基因。