Suppr超能文献

RefSeq 与宏基因组时代的原核生物基因组注释流程。

RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes.

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.

出版信息

Nucleic Acids Res. 2024 Jan 5;52(D1):D762-D769. doi: 10.1093/nar/gkad988.

Abstract

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and 236 million proteins with up-to-date and consistent annotation. In the past 3 years, we have expanded the diversity of the RefSeq collection by including the best quality metagenome-assembled genomes (MAGs) submitted to INSDC (DDBJ, ENA and GenBank), while maintaining its quality by adding validation checks. Assemblies are now more stringently evaluated for contamination and for completeness of annotation prior to acceptance into RefSeq. MAGs now account for over 17000 assemblies in RefSeq, split over 165 orders and 362 families. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP), which is used to annotate nearly all RefSeq assemblies include better detection of protein-coding genes. Nearly 83% of RefSeq proteins are now named by a curated Protein Family Model, a 4.7% increase in the past three years ago. In addition to literature citations, Enzyme Commission numbers, and gene symbols, Gene Ontology terms are now assigned to 48% of RefSeq proteins, allowing for easier multi-genome comparison. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/. PGAP is available as a stand-alone tool able to produce GenBank-ready files at https://github.com/ncbi/pgap.

摘要

国家生物技术信息中心 (NCBI) 的参考序列 (RefSeq) 项目包含超过 315,000 个细菌和古菌基因组和 2.36 亿个蛋白质,具有最新和一致的注释。在过去的 3 年中,我们通过包含提交给 INSDC(DDBJ、ENA 和 GenBank)的最佳质量宏基因组组装基因组 (MAG) 来扩展 RefSeq 集合的多样性,同时通过添加验证检查来保持其质量。在接受 RefSeq 之前,现在对组装体的污染和注释完整性进行更严格的评估。MAG 现在在 RefSeq 中占超过 17000 个组装体,分布在 165 个订单和 362 个家族中。用于注释几乎所有 RefSeq 组装体的原核基因组注释管道 (PGAP) 的变化包括更好地检测蛋白质编码基因。现在 RefSeq 中近 83%的蛋白质由经过精心整理的蛋白质家族模型命名,比三年前增加了 4.7%。除了文献引用、酶委员会编号和基因符号外,现在还为 48%的 RefSeq 蛋白质分配了基因本体论术语,从而更容易进行多基因组比较。RefSeq 可在 https://www.ncbi.nlm.nih.gov/refseq/ 找到。PGAP 是一个独立的工具,可在 https://github.com/ncbi/pgap 上生成可用于 GenBank 的文件。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5516/10767926/054e38a29f3f/gkad988figgra1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验