Chen R, Bouck J B, Weinstock G M, Gibbs R A
Department of Molecular and Human Genetics, Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA.
Genome Res. 2001 Nov;11(11):1807-16. doi: 10.1101/gr.203601.
Multi-species sequence comparisons are a very efficient way to reveal conserved genes. Because sequence finishing is expensive and time consuming, many genome sequences are likely to stay incomplete. A challenge is to use these fragmented data for understanding the human genome. Methods for using cross-species whole-genome shotgun sequence (WGS) for genome annotation are described in this paper. About one-half million high-quality rat WGS reads (covering 7.5% of the rat genome) generated at the Baylor College of Medicine Human Genome Sequencing Center were compared with the human genome. Using computer-generated random reads as a negative control, a set of parameters was determined for reliable interpretation of BLAST search results. About 10% of the rat reads contain regions that are conserved in the human genomic sequence and about one-third of these include known gene-coding regions. Mapping the conserved regions to human chromosomes showed a 23-fold enrichment for coding regions compared with noncoding regions. This approach can also be applied to other mammalian genomes for gene finding. These data predicted approximately 42,500 genes in the human, slightly more than reported previously.
多物种序列比较是揭示保守基因的一种非常有效的方法。由于序列完成既昂贵又耗时,许多基因组序列可能仍不完整。一项挑战是利用这些碎片化数据来理解人类基因组。本文描述了使用跨物种全基因组鸟枪法测序(WGS)进行基因组注释的方法。贝勒医学院人类基因组测序中心生成的约50万条高质量大鼠WGS读段(覆盖大鼠基因组的7.5%)与人类基因组进行了比较。以计算机生成的随机读段作为阴性对照,确定了一组参数以可靠地解释BLAST搜索结果。约10%的大鼠读段包含在人类基因组序列中保守的区域,其中约三分之一包含已知的基因编码区域。将保守区域映射到人类染色体上显示,与非编码区域相比,编码区域的富集度提高了23倍。这种方法也可应用于其他哺乳动物基因组以寻找基因。这些数据预测人类约有42,500个基因,略多于先前报道的数量。