Xuan Zhenyu, Wang Jinhua, Zhang Michael Q
Cold Spring Harbor Laboratory, New York, NY 11724, USA.
Genome Biol. 2003;4(1):R1. doi: 10.1186/gb-2002-4-1-r1. Epub 2002 Dec 5.
The availability of both mouse and human draft genomes has marked the beginning of a new era of comparative mammalian genomics. The two available mouse genome assemblies, from the public mouse genome sequencing consortium and Celera Genomics, were obtained using different clone libraries and different assembly methods.
We present here a critical comparison of the two latest mouse genome assemblies. The utility of the combined genomes is further demonstrated by comparing them with the human 'golden path' and through a subsequent analysis of a resulting conserved sequence element (CSE) database, which allows us to identify over 6,000 potential novel genes and to derive independent estimates of the number of human protein-coding genes.
The Celera and public mouse assemblies differ in about 10% of the mouse genome. Each assembly has advantages over the other: Celera has higher accuracy in base-pairs and overall higher coverage of the genome; the public assembly, however, has higher sequence quality in some newly finished bacterial artificial chromosome clone (BAC) regions and the data are freely accessible. Perhaps most important, by combining both assemblies, we can get a better annotation of the human genome; in particular, we can obtain the most complete set of CSEs, one third of which are related to known genes and some others are related to other functional genomic regions. More than half the CSEs are of unknown function. From the CSEs, we estimate the total number of human protein-coding genes to be about 40,000. This searchable publicly available online CSEdb will expedite new discoveries through comparative genomics.
小鼠和人类基因组草图的可得性标志着比较哺乳动物基因组学新时代的开始。来自公共小鼠基因组测序联盟和赛莱拉基因组公司的两个可用小鼠基因组组装体,是使用不同的克隆文库和不同的组装方法获得的。
我们在此对两个最新的小鼠基因组组装体进行了关键比较。通过将它们与人类“黄金路径”进行比较,并对由此产生的保守序列元件(CSE)数据库进行后续分析,进一步证明了组合基因组的效用,这使我们能够识别出6000多个潜在的新基因,并得出人类蛋白质编码基因数量的独立估计值。
赛莱拉和公共小鼠组装体在约10%的小鼠基因组中存在差异。每个组装体都有其优于另一个的地方:赛莱拉在碱基对方面具有更高的准确性,并且基因组的整体覆盖率更高;然而,公共组装体在一些新完成的细菌人工染色体克隆(BAC)区域具有更高的序列质量,并且数据可免费获取。也许最重要的是,通过结合这两个组装体,我们可以更好地注释人类基因组;特别是,我们可以获得最完整的CSE集合,其中三分之一与已知基因相关,其他一些与其他功能基因组区域相关。超过一半的CSE功能未知。从CSE中,我们估计人类蛋白质编码基因的总数约为40000个。这个可搜索的公开在线CSE数据库将通过比较基因组学加速新的发现。