匹配经过整理的基因组数据库：一项并非易事的任务。

Matching curated genome databases: a non trivial task.

作者信息

Descorps-Declère Stéphane, Barba Matthieu, Labedan Bernard

机构信息

Institut de Génétique et Microbiologie, Université Paris Sud XI, CNRS UMR 8621, Bât, 400, 91405 Orsay Cedex, France.

出版信息

BMC Genomics. 2008 Oct 24;9:501. doi: 10.1186/1471-2164-9-501.

DOI:10.1186/1471-2164-9-501

PMID:18950477

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2596144/

Abstract

BACKGROUND

Curated databases of completely sequenced genomes have been designed independently at the NCBI (RefSeq) and EBI (Genome Reviews) to cope with non-standard annotation found in the version of the sequenced genome that has been published by databanks GenBank/EMBL/DDBJ. These curation attempts were expected to review the annotations and to improve their pertinence when using them to annotate newly released genome sequences by homology to previously annotated genomes. However, we observed that such an uncoordinated effort has two unwanted consequences. First, it is not trivial to map the protein identifiers of the same sequence in both databases. Secondly, the two reannotated versions of the same genome differ at the level of their structural annotation.

RESULTS

Here, we propose CorBank, a program devised to provide cross-referencing protein identifiers no matter what the level of identity is found between their matching sequences. Approximately 98% of the 1,983,258 amino acid sequences are matching, allowing instantaneous retrieval of their respective cross-references. CorBank further allows detecting any differences between the independently curated versions of the same genome. We found that the RefSeq and Genome Reviews versions are perfectly matching for only 50 of the 641 complete genomes we have analyzed. In all other cases there are differences occurring at the level of the coding sequence (CDS), and/or in the total number of CDS in the respective version of the same genome.CorBank is freely accessible at http://www.corbank.u-psud.fr. The CorBank site contains also updated publication of the exhaustive results obtained by comparing RefSeq and Genome Reviews versions of each genome. Accordingly, this web site allows easy search of cross-references between RefSeq, Genome Reviews, and UniProt, for either a single CDS or a whole replicon.

CONCLUSION

CorBank is very efficient in rapid detection of the numerous differences existing between RefSeq and Genome Reviews versions of the same curated genome. Although such differences are acceptable as reflecting different views, we suggest that curators of both genome databases could help reducing further divergence by agreeing on a minimal dialogue and attempting to publish the point of view of the other database whenever it is technically possible.

摘要

背景

美国国立生物技术信息中心（NCBI，参考序列数据库）和欧洲生物信息研究所（EBI，基因组综述数据库）各自独立设计了经过整理的全序列基因组数据库，以处理由GenBank/EMBL/DDBJ数据库发布的已测序基因组版本中发现的非标准注释。这些整理工作旨在审查注释，并在通过与先前注释的基因组进行同源性分析来注释新发布的基因组序列时提高其相关性。然而，我们发现这种不协调的工作产生了两个不良后果。第一，在两个数据库中映射相同序列的蛋白质标识符并非易事。第二，同一基因组的两个重新注释版本在结构注释层面存在差异。

结果

在此，我们提出了CorBank程序，该程序旨在提供交叉引用的蛋白质标识符，无论其匹配序列之间的一致性水平如何。在1,983,258个氨基酸序列中，约98%是匹配的，从而能够即时检索它们各自的交叉引用。CorBank还能进一步检测同一基因组独立整理版本之间的任何差异。我们发现，在我们分析的641个完整基因组中，参考序列数据库和基因组综述数据库版本仅在50个基因组上完全匹配。在所有其他情况下，在编码序列（CDS）层面以及同一基因组各自版本中的CDS总数上都存在差异。可通过http://www.corbank.u-psud.fr免费访问CorBank。CorBank网站还包含通过比较每个基因组的参考序列数据库和基因组综述数据库版本获得的详尽结果的更新出版物。因此，该网站允许轻松搜索参考序列数据库、基因组综述数据库和通用蛋白质数据库（UniProt）之间针对单个CDS或整个复制子的交叉引用。

结论

CorBank在快速检测同一整理基因组的参考序列数据库和基因组综述数据库版本之间存在大量差异方面非常高效。尽管这些差异作为反映不同观点是可以接受的，但我们建议两个基因组数据库的整理人员通过达成最小限度的对话并尽可能在技术上可行时发表另一个数据库的观点，来帮助减少进一步的分歧。

相似文献

Matching curated genome databases: a non trivial task.匹配经过整理的基因组数据库：一项并非易事的任务。

BMC Genomics. 2008 Oct 24;9:501. doi: 10.1186/1471-2164-9-501.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

MICheck: a web tool for fast checking of syntactic annotations of bacterial genomes.MICheck：一种用于快速检查细菌基因组句法注释的网络工具。

Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W471-9. doi: 10.1093/nar/gki498.

NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.NCBI参考序列（RefSeq）：一个经过整理的基因组、转录本和蛋白质的非冗余序列数据库。

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. doi: 10.1093/nar/gki025.

IGD: a resource for intronless genes in the human genome.IGD：人类基因组中无内含子基因的资源。

Gene. 2011 Nov 15;488(1-2):35-40. doi: 10.1016/j.gene.2011.08.013. Epub 2011 Sep 2.

[Correction of five different types of errors of model REFSEQs appeared in NCBI human gene database only by using two novel human genes C17orf32 and ZNF362].[仅通过使用两个新的人类基因C17orf32和ZNF362校正出现在NCBI人类基因数据库中的五种不同类型的模型REFSEQs错误]

Yi Chuan Xue Bao. 2004 Apr;31(4):325-34.

Remote access to ACNUC nucleotide and protein sequence databases at PBIL.远程访问PBIL的ACNUC核苷酸和蛋白质序列数据库。

Biochimie. 2008 Apr;90(4):555-62. doi: 10.1016/j.biochi.2007.07.003. Epub 2007 Jul 15.

GenColors: accelerated comparative analysis and annotation of prokaryotic genomes at various stages of completeness.GenColors：加速不同完整度阶段原核生物基因组的比较分析与注释

Bioinformatics. 2005 Sep 15;21(18):3669-71. doi: 10.1093/bioinformatics/bti606. Epub 2005 Aug 2.

GATA: a graphic alignment tool for comparative sequence analysis.GATA：一种用于比较序列分析的图形比对工具。

BMC Bioinformatics. 2005 Jan 17;6:9. doi: 10.1186/1471-2105-6-9.

JUICE: a data management system that facilitates the analysis of large volumes of information in an EST project workflow.JUICE：一个数据管理系统，可在EST项目工作流程中促进对大量信息的分析。

BMC Bioinformatics. 2006 Nov 23;7:513. doi: 10.1186/1471-2105-7-513.

本文引用的文献

The InterPro database and tools for protein domain analysis.用于蛋白质结构域分析的InterPro数据库及工具。

Curr Protoc Bioinformatics. 2008 Mar;Chapter 2:Unit 2.7. doi: 10.1002/0471250953.bi0207s21.

The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases.蛋白质标识符交叉引用（PICR）服务：协调多个源数据库中的蛋白质标识符。

BMC Bioinformatics. 2007 Oct 18;8:401. doi: 10.1186/1471-2105-8-401.

Multidimensional annotation of the Escherichia coli K-12 genome.大肠杆菌K-12基因组的多维度注释

Nucleic Acids Res. 2007;35(22):7577-90. doi: 10.1093/nar/gkm740. Epub 2007 Oct 16.

Entrez Gene: gene-centered information at NCBI.Entrez基因：美国国立医学图书馆国家生物技术信息中心的基因中心信息。

Nucleic Acids Res. 2007 Jan;35(Database issue):D26-31. doi: 10.1093/nar/gkl993. Epub 2006 Dec 5.

The Universal Protein Resource (UniProt).通用蛋白质资源（UniProt）。

Nucleic Acids Res. 2007 Jan;35(Database issue):D193-7. doi: 10.1093/nar/gkl929. Epub 2006 Nov 16.

CDD: a conserved domain database for interactive domain family analysis.CDD：用于交互式结构域家族分析的保守结构域数据库。

Nucleic Acids Res. 2007 Jan;35(Database issue):D237-40. doi: 10.1093/nar/gkl951. Epub 2006 Nov 29.

NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.美国国立生物技术信息中心参考序列（RefSeq）：一个经过整理的基因组、转录本和蛋白质的非冗余序列数据库。

Nucleic Acids Res. 2007 Jan;35(Database issue):D61-5. doi: 10.1093/nar/gkl842. Epub 2006 Nov 27.

Genome Reviews: standardizing content and representation of information about complete genomes.基因组综述：规范完整基因组信息的内容与呈现

OMICS. 2006 Summer;10(2):114-8. doi: 10.1089/omi.2006.10.114.

The nature and dynamics of bacterial genomes.细菌基因组的性质与动态变化

Science. 2006 Mar 24;311(5768):1730-3. doi: 10.1126/science.1119966.

Escherichia coli K-12: a cooperatively developed annotation snapshot--2005.大肠杆菌K-12：一个合作开发的注释快照——2005年。

Nucleic Acids Res. 2006 Jan 5;34(1):1-9. doi: 10.1093/nar/gkj405. Print 2006.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

匹配经过整理的基因组数据库：一项并非易事的任务。

Matching curated genome databases: a non trivial task.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献