Clamp Michele, Fry Ben, Kamal Mike, Xie Xiaohui, Cuff James, Lin Michael F, Kellis Manolis, Lindblad-Toh Kerstin, Lander Eric S
Broad Institute of Massachusetts Institute of Technology and Harvard, 7 Cambridge Center, Cambridge, MA 02142, USA.
Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33. doi: 10.1073/pnas.0709013104. Epub 2007 Nov 26.
Although the Human Genome Project was completed 4 years ago, the catalog of human protein-coding genes remains a matter of controversy. Current catalogs list a total of approximately 24,500 putative protein-coding genes. It is broadly suspected that a large fraction of these entries are functionally meaningless ORFs present by chance in RNA transcripts, because they show no evidence of evolutionary conservation with mouse or dog. However, there is currently no scientific justification for excluding ORFs simply because they fail to show evolutionary conservation: the alternative hypothesis is that most of these ORFs are actually valid human genes that reflect gene innovation in the primate lineage or gene loss in the other lineages. Here, we reject this hypothesis by carefully analyzing the nonconserved ORFs-specifically, their properties in other primates. We show that the vast majority of these ORFs are random occurrences. The analysis yields, as a by-product, a major revision of the current human catalogs, cutting the number of protein-coding genes to approximately 20,500. Specifically, it suggests that nonconserved ORFs should be added to the human gene catalog only if there is clear evidence of an encoded protein. It also provides a principled methodology for evaluating future proposed additions to the human gene catalog. Finally, the results indicate that there has been relatively little true innovation in mammalian protein-coding genes.
尽管人类基因组计划已于4年前完成,但人类蛋白质编码基因的目录仍存在争议。目前的目录总共列出了约24,500个推定的蛋白质编码基因。人们普遍怀疑,这些条目中很大一部分是RNA转录本中偶然出现的无功能意义的开放阅读框,因为它们没有显示出与小鼠或狗的进化保守性证据。然而,目前没有科学依据仅仅因为开放阅读框未能显示进化保守性就将其排除:另一种假设是,这些开放阅读框中的大多数实际上是有效的人类基因,反映了灵长类谱系中的基因创新或其他谱系中的基因丢失。在这里,我们通过仔细分析非保守开放阅读框,特别是它们在其他灵长类动物中的特性,否定了这一假设。我们表明,这些开放阅读框中的绝大多数是随机出现的。作为副产品,该分析对当前的人类目录进行了重大修订,将蛋白质编码基因的数量削减至约20,500个。具体而言,它表明只有在有明确的编码蛋白质证据时,非保守开放阅读框才应添加到人类基因目录中。它还提供了一种有原则的方法来评估未来提议添加到人类基因目录中的内容。最后,结果表明哺乳动物蛋白质编码基因中真正的创新相对较少。