RefSeq 与宏基因组时代的原核生物基因组注释流程。

RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes.

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.

出版信息

Nucleic Acids Res. 2024 Jan 5;52(D1):D762-D769. doi: 10.1093/nar/gkad988.

DOI:10.1093/nar/gkad988

PMID:37962425

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10767926/

Abstract

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and 236 million proteins with up-to-date and consistent annotation. In the past 3 years, we have expanded the diversity of the RefSeq collection by including the best quality metagenome-assembled genomes (MAGs) submitted to INSDC (DDBJ, ENA and GenBank), while maintaining its quality by adding validation checks. Assemblies are now more stringently evaluated for contamination and for completeness of annotation prior to acceptance into RefSeq. MAGs now account for over 17000 assemblies in RefSeq, split over 165 orders and 362 families. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP), which is used to annotate nearly all RefSeq assemblies include better detection of protein-coding genes. Nearly 83% of RefSeq proteins are now named by a curated Protein Family Model, a 4.7% increase in the past three years ago. In addition to literature citations, Enzyme Commission numbers, and gene symbols, Gene Ontology terms are now assigned to 48% of RefSeq proteins, allowing for easier multi-genome comparison. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/. PGAP is available as a stand-alone tool able to produce GenBank-ready files at https://github.com/ncbi/pgap.

摘要

国家生物技术信息中心 (NCBI) 的参考序列 (RefSeq) 项目包含超过 315,000 个细菌和古菌基因组和 2.36 亿个蛋白质，具有最新和一致的注释。在过去的 3 年中，我们通过包含提交给 INSDC（DDBJ、ENA 和 GenBank）的最佳质量宏基因组组装基因组 (MAG) 来扩展 RefSeq 集合的多样性，同时通过添加验证检查来保持其质量。在接受 RefSeq 之前，现在对组装体的污染和注释完整性进行更严格的评估。MAG 现在在 RefSeq 中占超过 17000 个组装体，分布在 165 个订单和 362 个家族中。用于注释几乎所有 RefSeq 组装体的原核基因组注释管道 (PGAP) 的变化包括更好地检测蛋白质编码基因。现在 RefSeq 中近 83%的蛋白质由经过精心整理的蛋白质家族模型命名，比三年前增加了 4.7%。除了文献引用、酶委员会编号和基因符号外，现在还为 48%的 RefSeq 蛋白质分配了基因本体论术语，从而更容易进行多基因组比较。RefSeq 可在 https://www.ncbi.nlm.nih.gov/refseq/ 找到。PGAP 是一个独立的工具，可在 https://github.com/ncbi/pgap 上生成可用于 GenBank 的文件。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5516/10767926/054e38a29f3f/gkad988figgra1.jpg

相似文献

RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes.RefSeq 与宏基因组时代的原核生物基因组注释流程。

Nucleic Acids Res. 2024 Jan 5;52(D1):D762-D769. doi: 10.1093/nar/gkad988.

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.RefSeq：通过蛋白质家族模型编纂扩展原核生物基因组注释管道的覆盖范围。

Nucleic Acids Res. 2021 Jan 8;49(D1):D1020-D1028. doi: 10.1093/nar/gkaa1105.

RefSeq: an update on prokaryotic genome annotation and curation.RefSeq：原核生物基因组注释和管理的最新进展。

Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860. doi: 10.1093/nar/gkx1068.

Update on RefSeq microbial genomes resources.RefSeq微生物基因组资源更新

Nucleic Acids Res. 2015 Jan;43(Database issue):D599-605. doi: 10.1093/nar/gku1062. Epub 2014 Dec 15.

EcoGene-RefSeq: EcoGene tools applied to the RefSeq prokaryotic genomes.EcoGene-RefSeq：应用于 RefSeq 原核基因组的 EcoGene 工具。

Bioinformatics. 2013 Aug 1;29(15):1917-8. doi: 10.1093/bioinformatics/btt302. Epub 2013 Jun 4.

Comparison of RefSeq protein-coding regions in human and vertebrate genomes.比较人类和脊椎动物基因组中的 RefSeq 编码蛋白区域。

BMC Genomics. 2013 Sep 25;14:654. doi: 10.1186/1471-2164-14-654.

COG database update: focus on microbial diversity, model organisms, and widespread pathogens.COG 数据库更新：重点关注微生物多样性、模式生物和广泛存在的病原体。

Nucleic Acids Res. 2021 Jan 8;49(D1):D274-D281. doi: 10.1093/nar/gkaa1018.

NCBI prokaryotic genome annotation pipeline.美国国立生物技术信息中心原核生物基因组注释管道

Nucleic Acids Res. 2016 Aug 19;44(14):6614-24. doi: 10.1093/nar/gkw569. Epub 2016 Jun 24.

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.美国国立生物技术信息中心的参考序列（RefSeq）数据库：当前状态、分类扩展及功能注释。

Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45. doi: 10.1093/nar/gkv1189. Epub 2015 Nov 8.

RefSeq: an update on mammalian reference sequences.RefSeq：哺乳动物参考序列的更新。

Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. doi: 10.1093/nar/gkt1114. Epub 2013 Nov 19.

引用本文的文献

Efficient sequence alignment against millions of prokaryotic genomes with LexicMap.使用LexicMap与数百万个原核生物基因组进行高效序列比对。

Nat Biotechnol. 2025 Sep 10. doi: 10.1038/s41587-025-02812-8.

GlycoEnzDB: A database of enzymes involved in human glycosylation.糖基化酶数据库：一个关于参与人类糖基化过程的酶的数据库。

bioRxiv. 2025 Aug 31:2025.08.30.673257. doi: 10.1101/2025.08.30.673257.

Surveillance of coronaviruses in wild aquatic birds in Hong Kong: expanded genetic diversity and discovery of novel subgenus in the .香港野生水鸟中冠状病毒的监测：遗传多样性的扩展及该属新亚属的发现

Virus Evol. 2025 Jul 1;11(1):veaf049. doi: 10.1093/ve/veaf049. eCollection 2025.

Scaling laws of bacterial and archaeal plasmids.细菌和古菌质粒的标度律。

Nat Commun. 2025 Jul 2;16(1):6023. doi: 10.1038/s41467-025-61205-2.

Draft genome sequence of biofilm-forming HLHR and non-biofilm-producing sp. HLMP isolated from vermicompost.从蚯蚓堆肥中分离得到的具有生物膜形成能力的HLHR和不产生生物膜的sp. HLMP的基因组序列草图

Microbiol Resour Announc. 2025 Jul 10;14(7):e0000225. doi: 10.1128/mra.00002-25. Epub 2025 Jun 3.

Bioinformatic discovery of type 11 secretion system (T11SS) cargo across the .通过生物信息学在……范围内发现11型分泌系统（T11SS）的货物蛋白。（你提供的原文似乎不完整，across后面缺少具体内容）

Microb Genom. 2025 May;11(5). doi: 10.1099/mgen.0.001406.

The Abundance of Viroid-Like RNA Obelisk-S.s in Streptococcus sanguinis SK36 May Suffice for Evolutionary Persistence.血链球菌SK36中类病毒RNA Obelisk-S.s的丰度可能足以实现进化持久性。

J Mol Evol. 2025 May 9. doi: 10.1007/s00239-025-10250-y.

Spatial entropy drives the maintenance and dissemination of transferable plasmids.空间熵驱动可转移质粒的维持与传播。

Mol Syst Biol. 2025 Apr 29. doi: 10.1038/s44320-025-00110-8.

Small amounts of misassembly can have disproportionate effects on pangenome-based metagenomic analyses.少量的错误组装可能会对基于泛基因组的宏基因组分析产生不成比例的影响。

mSphere. 2025 May 27;10(5):e0085724. doi: 10.1128/msphere.00857-24. Epub 2025 Apr 29.

Molecular typing of Mycobacterium tuberculosis: a review of current methods, databases, softwares, and analytical tools.结核分枝杆菌的分子分型：当前方法、数据库、软件及分析工具综述

FEMS Microbiol Rev. 2025 Jan 14;49. doi: 10.1093/femsre/fuaf017.

本文引用的文献

Gut Microbiome Composition and Its Association with Sleep in Major Psychiatric Disorders.肠道微生物组组成及其与主要精神疾病睡眠的关系。

Neuropsychobiology. 2023;82(4):220-233. doi: 10.1159/000530386. Epub 2023 Jun 15.

Genome mining unveils a class of ribosomal peptides with two amino termini.基因组挖掘揭示了一类具有两个氨基末端的核糖体肽。

Nat Commun. 2023 Mar 23;14(1):1624. doi: 10.1038/s41467-023-37287-1.

The Gene Ontology knowledgebase in 2023.2023 版基因本体论知识库。

Genetics. 2023 May 4;224(1). doi: 10.1093/genetics/iyad031.

Collection and curation of prokaryotic genome assemblies from type strains at NCBI.从 NCBI 的模式菌株中收集和整理原核生物基因组组装。

Int J Syst Evol Microbiol. 2023 Feb;73(1). doi: 10.1099/ijsem.0.005707.

Eight Unexpected Selenoprotein Families in Organometallic Biochemistry in Clostridium difficile, in ABC Transport, and in Methylmercury Biosynthesis.艰难梭菌的金属有机生物化学、ABC 转运蛋白和甲基汞生物合成中 8 个意想不到的硒蛋白家族

J Bacteriol. 2023 Jan 26;205(1):e0025922. doi: 10.1128/jb.00259-22. Epub 2023 Jan 4.

Emotional-Single Prolonged Stress: A promising model to illustrate the gut-brain interaction.情绪性单次长期应激：一种阐释肠脑相互作用的有前景的模型。

Physiol Behav. 2023 Mar 1;260:114070. doi: 10.1016/j.physbeh.2022.114070. Epub 2022 Dec 24.

Effects of whole maize high-grain diet feeding on colonic fermentation and bacterial community in weaned lambs.全玉米高谷物日粮饲喂对断奶羔羊结肠发酵和细菌群落的影响。

Front Microbiol. 2022 Dec 8;13:1018284. doi: 10.3389/fmicb.2022.1018284. eCollection 2022.

DPAM: A domain parser for AlphaFold models.DPAM：用于 AlphaFold 模型的域解析器。

Protein Sci. 2023 Feb;32(2):e4548. doi: 10.1002/pro.4548.

The conserved domain database in 2023.2023 年的保守域数据库。

Nucleic Acids Res. 2023 Jan 6;51(D1):D384-D388. doi: 10.1093/nar/gkac1096.

DNA Data Bank of Japan (DDBJ) update report 2022.日本 DNA 数据库 (DDBJ) 更新报告 2022。

Nucleic Acids Res. 2023 Jan 6;51(D1):D101-D105. doi: 10.1093/nar/gkac1083.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

RefSeq 与宏基因组时代的原核生物基因组注释流程。

RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献