区分人类基因组中的蛋白质编码基因和非编码基因。

Distinguishing protein-coding and noncoding genes in the human genome.

作者信息

Clamp Michele, Fry Ben, Kamal Mike, Xie Xiaohui, Cuff James, Lin Michael F, Kellis Manolis, Lindblad-Toh Kerstin, Lander Eric S

机构信息

Broad Institute of Massachusetts Institute of Technology and Harvard, 7 Cambridge Center, Cambridge, MA 02142, USA.

出版信息

Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33. doi: 10.1073/pnas.0709013104. Epub 2007 Nov 26.

DOI:10.1073/pnas.0709013104

PMID:18040051

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2148306/

Abstract

Although the Human Genome Project was completed 4 years ago, the catalog of human protein-coding genes remains a matter of controversy. Current catalogs list a total of approximately 24,500 putative protein-coding genes. It is broadly suspected that a large fraction of these entries are functionally meaningless ORFs present by chance in RNA transcripts, because they show no evidence of evolutionary conservation with mouse or dog. However, there is currently no scientific justification for excluding ORFs simply because they fail to show evolutionary conservation: the alternative hypothesis is that most of these ORFs are actually valid human genes that reflect gene innovation in the primate lineage or gene loss in the other lineages. Here, we reject this hypothesis by carefully analyzing the nonconserved ORFs-specifically, their properties in other primates. We show that the vast majority of these ORFs are random occurrences. The analysis yields, as a by-product, a major revision of the current human catalogs, cutting the number of protein-coding genes to approximately 20,500. Specifically, it suggests that nonconserved ORFs should be added to the human gene catalog only if there is clear evidence of an encoded protein. It also provides a principled methodology for evaluating future proposed additions to the human gene catalog. Finally, the results indicate that there has been relatively little true innovation in mammalian protein-coding genes.

摘要

尽管人类基因组计划已于4年前完成，但人类蛋白质编码基因的目录仍存在争议。目前的目录总共列出了约24,500个推定的蛋白质编码基因。人们普遍怀疑，这些条目中很大一部分是RNA转录本中偶然出现的无功能意义的开放阅读框，因为它们没有显示出与小鼠或狗的进化保守性证据。然而，目前没有科学依据仅仅因为开放阅读框未能显示进化保守性就将其排除：另一种假设是，这些开放阅读框中的大多数实际上是有效的人类基因，反映了灵长类谱系中的基因创新或其他谱系中的基因丢失。在这里，我们通过仔细分析非保守开放阅读框，特别是它们在其他灵长类动物中的特性，否定了这一假设。我们表明，这些开放阅读框中的绝大多数是随机出现的。作为副产品，该分析对当前的人类目录进行了重大修订，将蛋白质编码基因的数量削减至约20,500个。具体而言，它表明只有在有明确的编码蛋白质证据时，非保守开放阅读框才应添加到人类基因目录中。它还提供了一种有原则的方法来评估未来提议添加到人类基因目录中的内容。最后，结果表明哺乳动物蛋白质编码基因中真正的创新相对较少。

相似文献

Distinguishing protein-coding and noncoding genes in the human genome.

Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33. doi: 10.1073/pnas.0709013104. Epub 2007 Nov 26.

Finding protein-coding genes through human polymorphisms.

PLoS One. 2013;8(1):e54210. doi: 10.1371/journal.pone.0054210. Epub 2013 Jan 22.

Stochastic Gain and Loss of Novel Transcribed Open Reading Frames in the Human Lineage.

Genome Biol Evol. 2020 Nov 3;12(11):2183-2195. doi: 10.1093/gbe/evaa194.

Upstream open reading frames may contain hundreds of novel human exons.

PLoS Comput Biol. 2024 Nov 20;20(11):e1012543. doi: 10.1371/journal.pcbi.1012543. eCollection 2024 Nov.

Genome-wide computational identification and manual annotation of human long noncoding RNA genes.

RNA. 2010 Aug;16(8):1478-87. doi: 10.1261/rna.1951310. Epub 2010 Jun 29.

Reconciling the numbers: ESTs versus protein-coding genes.

Mol Biol Evol. 2004 Jul;21(7):1278-82. doi: 10.1093/molbev/msh125. Epub 2004 Mar 19.

Comparative genomic analyses highlight the contribution of pseudogenized protein-coding genes to human lincRNAs.

BMC Genomics. 2017 Oct 16;18(1):786. doi: 10.1186/s12864-017-4156-x.

Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci.

Genome Res. 2019 Dec;29(12):2073-2087. doi: 10.1101/gr.246462.118. Epub 2019 Sep 19.

Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability.

Nucleic Acids Res. 2005 Apr 28;33(8):2374-83. doi: 10.1093/nar/gki531. Print 2005.

Revisiting the missing protein-coding gene catalog of the domestic dog.

BMC Genomics. 2009 Feb 4;10:62. doi: 10.1186/1471-2164-10-62.

引用本文的文献

Deciphering enzymatic potential in metagenomic reads through DNA language models.

Nucleic Acids Res. 2025 Aug 27;53(16). doi: 10.1093/nar/gkaf836.

Long non-coding RNA : A crucial factor in fibrotic diseases.

Mol Ther Nucleic Acids. 2025 Jul 17;36(3):102630. doi: 10.1016/j.omtn.2025.102630. eCollection 2025 Sep 9.

Harnessing Computational Approaches for RNA-Targeted Drug Discovery.

RNA Nanomed. 2024 Dec;1(1):1-15. doi: 10.59566/isrnn.2024.0101001.

Element-specific estimation of background mutation rates in whole cancer genomes through transfer learning.

NPJ Precis Oncol. 2025 Mar 29;9(1):92. doi: 10.1038/s41698-025-00871-3.

The Druggable Transcriptome Project: From Chemical Probes to Precision Medicines.

Biochemistry. 2025 Apr 15;64(8):1647-1661. doi: 10.1021/acs.biochem.5c00006. Epub 2025 Mar 25.

Eurycomalactone switched hepatocellular carcinoma cells into quiescence through 5'tRF/DVL/β-catenin pathway inhibition.

Sci Rep. 2025 Mar 24;15(1):10106. doi: 10.1038/s41598-025-86888-x.

Three- and four-stranded nucleic acid structures and their ligands.

RSC Chem Biol. 2025 Feb 19;6(4):466-491. doi: 10.1039/d4cb00287c. eCollection 2025 Apr 2.

A chromosome-anchored reference assembly for the gray snapper, Lutjanus griseus.

Mol Biol Rep. 2025 Jan 3;52(1):99. doi: 10.1007/s11033-024-10195-3.

Online-adjusted evolutionary biclustering algorithm to identify significant modules in gene expression data.

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae681.

More than 2,500 coding genes in the human reference gene set still have unsettled status.

bioRxiv. 2024 Dec 9:2024.12.05.626965. doi: 10.1101/2024.12.05.626965.

本文引用的文献

Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes.

Genome Res. 2007 Dec;17(12):1823-36. doi: 10.1101/gr.6679507. Epub 2007 Nov 7.

Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Nature. 2007 Jun 14;447(7146):799-816. doi: 10.1038/nature05874.

Ensembl 2007.

Nucleic Acids Res. 2007 Jan;35(Database issue):D610-7. doi: 10.1093/nar/gkl996. Epub 2006 Dec 5.

NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Nucleic Acids Res. 2007 Jan;35(Database issue):D61-5. doi: 10.1093/nar/gkl842. Epub 2006 Nov 27.

Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human.

PLoS Comput Biol. 2006 Sep 29;2(9):e133. doi: 10.1371/journal.pcbi.0020133.

Evolutionary fate of retroposed gene copies in the human genome.

Proc Natl Acad Sci U S A. 2006 Feb 28;103(9):3220-5. doi: 10.1073/pnas.0511307103. Epub 2006 Feb 21.

Pfam: clans, web tools and services.

Nucleic Acids Res. 2006 Jan 1;34(Database issue):D247-51. doi: 10.1093/nar/gkj149.

Genome sequence, comparative analysis and haplotype structure of the domestic dog.

Nature. 2005 Dec 8;438(7069):803-19. doi: 10.1038/nature04338.

The transcriptional landscape of the mammalian genome.

Science. 2005 Sep 2;309(5740):1559-63. doi: 10.1126/science.1112014.

Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution.

Science. 2005 May 20;308(5725):1149-54. doi: 10.1126/science.1108625. Epub 2005 Mar 24.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

区分人类基因组中的蛋白质编码基因和非编码基因。

Distinguishing protein-coding and noncoding genes in the human genome.

作者信息

Clamp Michele, Fry Ben, Kamal Mike, Xie Xiaohui, Cuff James, Lin Michael F, Kellis Manolis, Lindblad-Toh Kerstin, Lander Eric S

机构信息

Broad Institute of Massachusetts Institute of Technology and Harvard, 7 Cambridge Center, Cambridge, MA 02142, USA.

出版信息

Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33. doi: 10.1073/pnas.0709013104. Epub 2007 Nov 26.

DOI:10.1073/pnas.0709013104

PMID:18040051

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2148306/

Abstract

摘要

区分人类基因组中的蛋白质编码基因和非编码基因。

Distinguishing protein-coding and noncoding genes in the human genome.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

区分人类基因组中的蛋白质编码基因和非编码基因。

Distinguishing protein-coding and noncoding genes in the human genome.

作者信息

机构信息

出版信息