O'Leary Nuala A, Wright Mathew W, Brister J Rodney, Ciufo Stacy, Haddad Diana, McVeigh Rich, Rajput Bhanu, Robbertse Barbara, Smith-White Brian, Ako-Adjei Danso, Astashyn Alexander, Badretdin Azat, Bao Yiming, Blinkova Olga, Brover Vyacheslav, Chetvernin Vyacheslav, Choi Jinna, Cox Eric, Ermolaeva Olga, Farrell Catherine M, Goldfarb Tamara, Gupta Tripti, Haft Daniel, Hatcher Eneida, Hlavina Wratko, Joardar Vinita S, Kodali Vamsi K, Li Wenjun, Maglott Donna, Masterson Patrick, McGarvey Kelly M, Murphy Michael R, O'Neill Kathleen, Pujar Shashikant, Rangwala Sanjida H, Rausch Daniel, Riddick Lillian D, Schoch Conrad, Shkeda Andrei, Storz Susan S, Sun Hanzhen, Thibaud-Nissen Francoise, Tolstoy Igor, Tully Raymond E, Vatsan Anjana R, Wallin Craig, Webb David, Wu Wendy, Landrum Melissa J, Kimchi Avi, Tatusova Tatiana, DiCuccio Michael, Kitts Paul, Murphy Terence D, Pruitt Kim D
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45. doi: 10.1093/nar/gkv1189. Epub 2015 Nov 8.
The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.
美国国立生物技术信息中心(NCBI)的参考序列(RefSeq)项目维护并管理一个可公开获取的数据库,其中包含注释后的基因组、转录本和蛋白质序列记录(http://www.ncbi.nlm.nih.gov/refseq/)。RefSeq项目利用提交给国际核苷酸序列数据库协作组织(INSDC)的数据,结合计算、人工整理和合作,生成一组标准的稳定、非冗余参考序列。RefSeq项目利用包括出版物、功能特征和信息丰富的命名法等当前知识,对这些参考序列进行补充。该数据库目前代表了来自超过55,000种生物的序列(>4800种病毒、>40,000种原核生物和>10,000种真核生物;RefSeq版本71),范围从单个记录到完整基因组。本文总结了RefSeq项目中病毒、原核生物和真核生物分支的当前状态,报告了数据访问方面的改进,并详细介绍了进一步扩大该集合分类代表性的努力。我们还重点介绍了各种功能整理计划,这些计划支持RefSeq数据的多种用途,包括分类验证、基因组注释、比较基因组学和临床检测。我们总结了在脊椎动物、植物和其他物种的人工整理过程中利用可用RNA测序和其他数据类型的方法,并描述了原核生物基因组和蛋白质名称管理的新方向。