Suppr超能文献

RefSeq:通过蛋白质家族模型编纂扩展原核生物基因组注释管道的覆盖范围。

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20892-6511, USA.

出版信息

Nucleic Acids Res. 2021 Jan 8;49(D1):D1020-D1028. doi: 10.1093/nar/gkaa1105.

Abstract

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

摘要

国家生物技术信息中心 (NCBI) 的参考序列 (RefSeq) 项目包含近 20 万种细菌和古菌基因组以及 1.5 亿种具有最新注释的蛋白质。自 2018 年以来,原核生物基因组注释流水线 (PGAP) 的变化导致虚假注释大量减少。PGAP 用作结构和功能注释证据的蛋白质家族模型 (PFM) 的分层集合已扩展到超过 35000 个蛋白质轮廓隐马尔可夫模型 (HMM)、12300 个 BlastRules 和 36000 个经过策管的 CDD 架构。因此,现在超过 1.22 亿或 79%的 RefSeq 蛋白质是根据与策管 PFM 的匹配来命名的。超过 40%的 PFM 具有基因符号、酶委员会编号或支持出版物属性,并通过它们命名的蛋白质和特征继承,从而促进多基因组分析和与文献的联系。为了遵守 FAIR(可发现、可访问、可互操作、可重用)原则,任何用户都可以在蛋白质家族模型 Entrez 数据库中访问 PFM。最后,参考和代表性基因组集是 RefSeq 原核生物基因组的一个具有分类多样性的子集,现在定期重新计算,并可用于下载和与 BLAST 进行同源搜索。RefSeq 可在 https://www.ncbi.nlm.nih.gov/refseq/ 找到。

相似文献

5
RefSeq: an update on mammalian reference sequences.RefSeq:哺乳动物参考序列的更新。
Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. doi: 10.1093/nar/gkt1114. Epub 2013 Nov 19.
7
Update on RefSeq microbial genomes resources.RefSeq微生物基因组资源更新
Nucleic Acids Res. 2015 Jan;43(Database issue):D599-605. doi: 10.1093/nar/gku1062. Epub 2014 Dec 15.
10
The UCSC Genome Browser database: 2021 update.UCSC 基因组浏览器数据库:2021 年更新。
Nucleic Acids Res. 2021 Jan 8;49(D1):D1046-D1057. doi: 10.1093/nar/gkaa1070.

引用本文的文献

本文引用的文献

4
CDD/SPARCLE: the conserved domain database in 2020.CDD/SPARCLE:2020 年的保守结构域数据库。
Nucleic Acids Res. 2020 Jan 8;48(D1):D265-D268. doi: 10.1093/nar/gkz991.
6
The EcoCyc Database.EcoCyc数据库。
EcoSal Plus. 2018 Nov;8(1). doi: 10.1128/ecosalplus.ESP-0006-2018.
10
The Pfam protein families database in 2019.2019 年 Pfam 蛋白质家族数据库。
Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432. doi: 10.1093/nar/gky995.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验