比较人类和脊椎动物基因组中的 RefSeq 编码蛋白区域。

Comparison of RefSeq protein-coding regions in human and vertebrate genomes.

机构信息

National Center for Biotechnology Information, U,S, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA.

出版信息

BMC Genomics. 2013 Sep 25;14:654. doi: 10.1186/1471-2164-14-654.

DOI:10.1186/1471-2164-14-654

PMID:24063302

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3882889/

Abstract

BACKGROUND

Advances in high-throughput sequencing technology have yielded a large number of publicly available vertebrate genomes, many of which are selected for inclusion in NCBI's RefSeq project and subsequently processed by NCBI's eukaryotic annotation pipeline. Genome annotation results are affected by differences in available support evidence and may be impacted by annotation pipeline software changes over time. The RefSeq project has not previously assessed annotation trends across organisms or over time. To address this deficiency, we have developed a comparative protocol which integrates analysis of annotated protein-coding regions across a data set of vertebrate orthologs in genomic sequence coordinates, protein sequences, and protein features.

RESULTS

We assessed an ortholog dataset that includes 34 annotated vertebrate RefSeq genomes including human. We confirm that RefSeq protein-coding gene annotations in mammals exhibit considerable similarity. Over 50% of the orthologous protein-coding genes in 20 organisms are supported at the level of splicing conservation with at least three selected reference genomes. Approximately 7,500 ortholog sets include at least half of the analyzed organisms, show highly similar sequence and conserved splicing, and may serve as a minimal set of mammalian "core proteins" for initial assessment of new mammalian genomes. Additionally, 80% of the proteins analyzed pass a suite of tests to detect proteins that lack splicing conservation and have unusual sequence or domain annotation. We use these tests to define an annotation quality metric that is based directly on the annotated proteins thus operates independently of other quality metrics such as availability of transcripts or assembly quality measures. Results are available on the RefSeq FTP site [http://ftp.ncbi.nlm.nih.gov/refseq/supplemental/ProtCore/SM1.txt].

CONCLUSIONS

Our multi-factored analysis demonstrates a high level of consistency in RefSeq protein representation among vertebrates. We find that the majority of the RefSeq vertebrate proteins for which we have calculated orthology are good as measured by these metrics. The process flow described provides specific information on the scope and degree of conservation for the analyzed protein sequences and annotations and will be used to enrich the quality of RefSeq records by identifying targets for further improvement in the computational annotation pipeline, and by flagging specific genes for manual curation.

摘要

背景

高通量测序技术的进步产生了大量可公开获得的脊椎动物基因组，其中许多被选为纳入 NCBI 的 RefSeq 项目，并随后由 NCBI 的真核注释流水线进行处理。基因组注释结果受可用支持证据的差异影响，并且可能受到注释流水线软件随时间的变化的影响。RefSeq 项目以前没有评估过跨生物或随时间的注释趋势。为了解决这个问题，我们开发了一种比较协议，该协议整合了在基因组序列坐标、蛋白质序列和蛋白质特征的脊椎动物直系同源物数据集上分析注释的蛋白质编码区。

结果

我们评估了一个包含 34 个已注释的脊椎动物 RefSeq 基因组（包括人类）的直系同源物数据集。我们证实，哺乳动物 RefSeq 蛋白质编码基因注释具有相当大的相似性。在 20 个生物中，超过 50%的直系同源蛋白质编码基因在至少三个选定的参考基因组的剪接保守水平上得到支持。大约 7500 个直系同源物集合至少包含分析的生物的一半，表现出高度相似的序列和保守的剪接，并且可以作为初始评估新哺乳动物基因组的哺乳动物“核心蛋白”的最小集合。此外，80%的分析蛋白质通过一系列测试来检测缺乏剪接保守性且具有异常序列或结构域注释的蛋白质。我们使用这些测试来定义一个基于注释蛋白质的注释质量度量，该度量直接操作，独立于其他质量度量，例如转录本的可用性或组装质量度量。结果可在 RefSeq FTP 站点[http://ftp.ncbi.nlm.nih.gov/refseq/supplemental/ProtCore/SM1.txt]上获得。

结论

我们的多因素分析表明，脊椎动物之间 RefSeq 蛋白质表示具有高度的一致性。我们发现，我们计算了直系同源性的大多数 RefSeq 脊椎动物蛋白质在这些度量标准下都很好。所描述的流程提供了有关分析的蛋白质序列和注释的范围和保守程度的具体信息，并将用于通过确定计算注释流水线进一步改进的目标，以及通过标记特定基因进行手动整理，来丰富 RefSeq 记录的质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5129/3882889/efaa77e3f1d9/1471-2164-14-654-1.jpg

相似文献

Comparison of RefSeq protein-coding regions in human and vertebrate genomes.比较人类和脊椎动物基因组中的 RefSeq 编码蛋白区域。

BMC Genomics. 2013 Sep 25;14:654. doi: 10.1186/1471-2164-14-654.

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.美国国立生物技术信息中心的参考序列（RefSeq）数据库：当前状态、分类扩展及功能注释。

Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45. doi: 10.1093/nar/gkv1189. Epub 2015 Nov 8.

RefSeq: an update on mammalian reference sequences.RefSeq：哺乳动物参考序列的更新。

Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. doi: 10.1093/nar/gkt1114. Epub 2013 Nov 19.

NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.美国国立生物技术信息中心参考序列（RefSeq）：一个经过整理的基因组、转录本和蛋白质的非冗余序列数据库。

Nucleic Acids Res. 2007 Jan;35(Database issue):D61-5. doi: 10.1093/nar/gkl842. Epub 2006 Nov 27.

RefSeq: an update on prokaryotic genome annotation and curation.RefSeq：原核生物基因组注释和管理的最新进展。

Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860. doi: 10.1093/nar/gkx1068.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy.NCBI 参考序列（RefSeq）：现状、新特性和基因组注释政策。

Nucleic Acids Res. 2012 Jan;40(Database issue):D130-5. doi: 10.1093/nar/gkr1079. Epub 2011 Nov 24.

NCBI Reference Sequences: current status, policy and new initiatives.NCBI参考序列：当前状态、政策及新举措。

Nucleic Acids Res. 2009 Jan;37(Database issue):D32-6. doi: 10.1093/nar/gkn721. Epub 2008 Oct 16.

Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction.GENCODE与RefSeq基因注释的比较以及参考基因集对变异效应预测的影响。

BMC Genomics. 2015;16 Suppl 8(Suppl 8):S2. doi: 10.1186/1471-2164-16-S8-S2. Epub 2015 Jun 18.

GASS: genome structural annotation for Eukaryotes based on species similarity.GASS：基于物种相似性的真核生物基因组结构注释

BMC Genomics. 2015 Mar 4;16(1):150. doi: 10.1186/s12864-015-1353-3.

引用本文的文献

Conservation assessment of human splice site annotation based on a 470-genome alignment.基于470个基因组比对的人类剪接位点注释的保守性评估。

Nucleic Acids Res. 2025 Mar 20;53(6). doi: 10.1093/nar/gkaf184.

RNase L represses hair follicle regeneration through altered innate immune signaling.核糖核酸酶L通过改变固有免疫信号传导来抑制毛囊再生。

J Clin Invest. 2025 Feb 4;135(6):e172595. doi: 10.1172/JCI172595.

Conservation assessment of human splice site annotation based on a 470-genome alignment.基于470个基因组比对的人类剪接位点注释的保守性评估

bioRxiv. 2025 Mar 15:2023.12.01.569581. doi: 10.1101/2023.12.01.569581.

Chromosome-length genome assemblies and cytogenomic analyses of pangolins reveal remarkable chromosome counts and plasticity.穿山甲的染色体水平基因组组装和细胞基因组分析揭示了显著的染色体数目和可塑性。

Chromosome Res. 2023 Apr 12;31(2):13. doi: 10.1007/s10577-023-09722-y.

Genome assembly and transcriptome resource for river buffalo, Bubalus bubalis (2n = 50).布氏野水牛（Bubalus bubalis）基因组组装和转录组资源（2n = 50）。

Gigascience. 2017 Oct 1;6(10):1-6. doi: 10.1093/gigascience/gix088.

Assisted transcriptome reconstruction and splicing orthology.辅助转录组重建与剪接直系同源关系。

BMC Genomics. 2016 Nov 11;17(Suppl 10):786. doi: 10.1186/s12864-016-3103-6.

A GC-rich sequence feature in the 3' UTR directs UPF1-dependent mRNA decay in mammalian cells.3'非翻译区中富含鸟苷酸-胞嘧啶的序列特征可引导哺乳动物细胞中UPF1依赖的mRNA降解。

Genome Res. 2017 Mar;27(3):407-418. doi: 10.1101/gr.206060.116. Epub 2016 Dec 9.

Differential Phasing between Circadian Clocks in the Brain and Peripheral Organs in Humans.人类大脑与外周器官生物钟之间的相位差异

J Biol Rhythms. 2016 Dec;31(6):588-597. doi: 10.1177/0748730416668049. Epub 2016 Oct 4.

Computational Identification of the Paralogs and Orthologs of Human Cytochrome P450 Superfamily and the Implication in Drug Discovery.人类细胞色素P450超家族旁系同源物和直系同源物的计算鉴定及其在药物发现中的意义

Int J Mol Sci. 2016 Jun 28;17(7):1020. doi: 10.3390/ijms17071020.

Pseudo-De Novo Assembly and Analysis of Unmapped Genome Sequence Reads in Wild Zebrafish Reveal Novel Gene Content.野生斑马鱼中未映射基因组序列读数的伪从头组装与分析揭示了新的基因内容。

Zebrafish. 2016 Apr;13(2):95-102. doi: 10.1089/zeb.2015.1154. Epub 2016 Feb 17.

本文引用的文献

Limitations of the rhesus macaque draft genome assembly and annotation.恒河猴基因组草图组装和注释的局限性。

BMC Genomics. 2012 May 30;13:206. doi: 10.1186/1471-2164-13-206.

The ancient and evolving roles of cohesin in gene expression and DNA repair.黏连蛋白在基因表达和 DNA 修复中的古老而演变的角色。

Curr Biol. 2012 Apr 10;22(7):R240-50. doi: 10.1016/j.cub.2012.02.046.

Database resources of the National Center for Biotechnology Information.国家生物技术信息中心数据库资源。

Nucleic Acids Res. 2012 Jan;40(Database issue):D13-25. doi: 10.1093/nar/gkr1184. Epub 2011 Dec 2.

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy.NCBI 参考序列（RefSeq）：现状、新特性和基因组注释政策。

Nucleic Acids Res. 2012 Jan;40(Database issue):D130-5. doi: 10.1093/nar/gkr1079. Epub 2011 Nov 24.

Reorganizing the protein space at the Universal Protein Resource (UniProt).重新组织通用蛋白质资源库（UniProt）中的蛋白质空间。

Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. doi: 10.1093/nar/gkr981. Epub 2011 Nov 18.

A high-resolution map of human evolutionary constraint using 29 mammals.利用 29 种哺乳动物绘制人类进化约束的高分辨率图谱。

Nature. 2011 Oct 12;478(7370):476-82. doi: 10.1038/nature10530.

The UCSC Genome Browser.加州大学圣克鲁兹分校基因组浏览器。

Curr Protoc Hum Genet. 2011 Oct;Chapter 18:18.6.1-18.6.33. doi: 10.1002/0471142905.hg1806s71.

Changes in exon-intron structure during vertebrate evolution affect the splicing pattern of exons.脊椎动物进化过程中外显子-内含子结构的变化影响外显子的剪接模式。

Genome Res. 2012 Jan;22(1):35-50. doi: 10.1101/gr.119834.110. Epub 2011 Oct 5.

SignalP 4.0: discriminating signal peptides from transmembrane regions.信号肽预测工具SignalP 4.0：区分信号肽与跨膜区域。

Nat Methods. 2011 Sep 29;8(10):785-6. doi: 10.1038/nmeth.1701.

Domain architecture conservation in orthologs.直系同源物中的结构域架构保守性。

BMC Bioinformatics. 2011 Aug 5;12:326. doi: 10.1186/1471-2105-12-326.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

比较人类和脊椎动物基因组中的 RefSeq 编码蛋白区域。

Comparison of RefSeq protein-coding regions in human and vertebrate genomes.

机构信息

National Center for Biotechnology Information, U,S, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA.