Suppr超能文献

比较人类和脊椎动物基因组中的 RefSeq 编码蛋白区域。

Comparison of RefSeq protein-coding regions in human and vertebrate genomes.

机构信息

National Center for Biotechnology Information, U,S, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA.

出版信息

BMC Genomics. 2013 Sep 25;14:654. doi: 10.1186/1471-2164-14-654.

Abstract

BACKGROUND

Advances in high-throughput sequencing technology have yielded a large number of publicly available vertebrate genomes, many of which are selected for inclusion in NCBI's RefSeq project and subsequently processed by NCBI's eukaryotic annotation pipeline. Genome annotation results are affected by differences in available support evidence and may be impacted by annotation pipeline software changes over time. The RefSeq project has not previously assessed annotation trends across organisms or over time. To address this deficiency, we have developed a comparative protocol which integrates analysis of annotated protein-coding regions across a data set of vertebrate orthologs in genomic sequence coordinates, protein sequences, and protein features.

RESULTS

We assessed an ortholog dataset that includes 34 annotated vertebrate RefSeq genomes including human. We confirm that RefSeq protein-coding gene annotations in mammals exhibit considerable similarity. Over 50% of the orthologous protein-coding genes in 20 organisms are supported at the level of splicing conservation with at least three selected reference genomes. Approximately 7,500 ortholog sets include at least half of the analyzed organisms, show highly similar sequence and conserved splicing, and may serve as a minimal set of mammalian "core proteins" for initial assessment of new mammalian genomes. Additionally, 80% of the proteins analyzed pass a suite of tests to detect proteins that lack splicing conservation and have unusual sequence or domain annotation. We use these tests to define an annotation quality metric that is based directly on the annotated proteins thus operates independently of other quality metrics such as availability of transcripts or assembly quality measures. Results are available on the RefSeq FTP site [http://ftp.ncbi.nlm.nih.gov/refseq/supplemental/ProtCore/SM1.txt].

CONCLUSIONS

Our multi-factored analysis demonstrates a high level of consistency in RefSeq protein representation among vertebrates. We find that the majority of the RefSeq vertebrate proteins for which we have calculated orthology are good as measured by these metrics. The process flow described provides specific information on the scope and degree of conservation for the analyzed protein sequences and annotations and will be used to enrich the quality of RefSeq records by identifying targets for further improvement in the computational annotation pipeline, and by flagging specific genes for manual curation.

摘要

背景

高通量测序技术的进步产生了大量可公开获得的脊椎动物基因组,其中许多被选为纳入 NCBI 的 RefSeq 项目,并随后由 NCBI 的真核注释流水线进行处理。基因组注释结果受可用支持证据的差异影响,并且可能受到注释流水线软件随时间的变化的影响。RefSeq 项目以前没有评估过跨生物或随时间的注释趋势。为了解决这个问题,我们开发了一种比较协议,该协议整合了在基因组序列坐标、蛋白质序列和蛋白质特征的脊椎动物直系同源物数据集上分析注释的蛋白质编码区。

结果

我们评估了一个包含 34 个已注释的脊椎动物 RefSeq 基因组(包括人类)的直系同源物数据集。我们证实,哺乳动物 RefSeq 蛋白质编码基因注释具有相当大的相似性。在 20 个生物中,超过 50%的直系同源蛋白质编码基因在至少三个选定的参考基因组的剪接保守水平上得到支持。大约 7500 个直系同源物集合至少包含分析的生物的一半,表现出高度相似的序列和保守的剪接,并且可以作为初始评估新哺乳动物基因组的哺乳动物“核心蛋白”的最小集合。此外,80%的分析蛋白质通过一系列测试来检测缺乏剪接保守性且具有异常序列或结构域注释的蛋白质。我们使用这些测试来定义一个基于注释蛋白质的注释质量度量,该度量直接操作,独立于其他质量度量,例如转录本的可用性或组装质量度量。结果可在 RefSeq FTP 站点[http://ftp.ncbi.nlm.nih.gov/refseq/supplemental/ProtCore/SM1.txt]上获得。

结论

我们的多因素分析表明,脊椎动物之间 RefSeq 蛋白质表示具有高度的一致性。我们发现,我们计算了直系同源性的大多数 RefSeq 脊椎动物蛋白质在这些度量标准下都很好。所描述的流程提供了有关分析的蛋白质序列和注释的范围和保守程度的具体信息,并将用于通过确定计算注释流水线进一步改进的目标,以及通过标记特定基因进行手动整理,来丰富 RefSeq 记录的质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5129/3882889/efaa77e3f1d9/1471-2164-14-654-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验