Suppr超能文献

匹配经过整理的基因组数据库:一项并非易事的任务。

Matching curated genome databases: a non trivial task.

作者信息

Descorps-Declère Stéphane, Barba Matthieu, Labedan Bernard

机构信息

Institut de Génétique et Microbiologie, Université Paris Sud XI, CNRS UMR 8621, Bât, 400, 91405 Orsay Cedex, France.

出版信息

BMC Genomics. 2008 Oct 24;9:501. doi: 10.1186/1471-2164-9-501.

Abstract

BACKGROUND

Curated databases of completely sequenced genomes have been designed independently at the NCBI (RefSeq) and EBI (Genome Reviews) to cope with non-standard annotation found in the version of the sequenced genome that has been published by databanks GenBank/EMBL/DDBJ. These curation attempts were expected to review the annotations and to improve their pertinence when using them to annotate newly released genome sequences by homology to previously annotated genomes. However, we observed that such an uncoordinated effort has two unwanted consequences. First, it is not trivial to map the protein identifiers of the same sequence in both databases. Secondly, the two reannotated versions of the same genome differ at the level of their structural annotation.

RESULTS

Here, we propose CorBank, a program devised to provide cross-referencing protein identifiers no matter what the level of identity is found between their matching sequences. Approximately 98% of the 1,983,258 amino acid sequences are matching, allowing instantaneous retrieval of their respective cross-references. CorBank further allows detecting any differences between the independently curated versions of the same genome. We found that the RefSeq and Genome Reviews versions are perfectly matching for only 50 of the 641 complete genomes we have analyzed. In all other cases there are differences occurring at the level of the coding sequence (CDS), and/or in the total number of CDS in the respective version of the same genome.CorBank is freely accessible at http://www.corbank.u-psud.fr. The CorBank site contains also updated publication of the exhaustive results obtained by comparing RefSeq and Genome Reviews versions of each genome. Accordingly, this web site allows easy search of cross-references between RefSeq, Genome Reviews, and UniProt, for either a single CDS or a whole replicon.

CONCLUSION

CorBank is very efficient in rapid detection of the numerous differences existing between RefSeq and Genome Reviews versions of the same curated genome. Although such differences are acceptable as reflecting different views, we suggest that curators of both genome databases could help reducing further divergence by agreeing on a minimal dialogue and attempting to publish the point of view of the other database whenever it is technically possible.

摘要

背景

美国国立生物技术信息中心(NCBI,参考序列数据库)和欧洲生物信息研究所(EBI,基因组综述数据库)各自独立设计了经过整理的全序列基因组数据库,以处理由GenBank/EMBL/DDBJ数据库发布的已测序基因组版本中发现的非标准注释。这些整理工作旨在审查注释,并在通过与先前注释的基因组进行同源性分析来注释新发布的基因组序列时提高其相关性。然而,我们发现这种不协调的工作产生了两个不良后果。第一,在两个数据库中映射相同序列的蛋白质标识符并非易事。第二,同一基因组的两个重新注释版本在结构注释层面存在差异。

结果

在此,我们提出了CorBank程序,该程序旨在提供交叉引用的蛋白质标识符,无论其匹配序列之间的一致性水平如何。在1,983,258个氨基酸序列中,约98%是匹配的,从而能够即时检索它们各自的交叉引用。CorBank还能进一步检测同一基因组独立整理版本之间的任何差异。我们发现,在我们分析的641个完整基因组中,参考序列数据库和基因组综述数据库版本仅在50个基因组上完全匹配。在所有其他情况下,在编码序列(CDS)层面以及同一基因组各自版本中的CDS总数上都存在差异。可通过http://www.corbank.u-psud.fr免费访问CorBank。CorBank网站还包含通过比较每个基因组的参考序列数据库和基因组综述数据库版本获得的详尽结果的更新出版物。因此,该网站允许轻松搜索参考序列数据库、基因组综述数据库和通用蛋白质数据库(UniProt)之间针对单个CDS或整个复制子的交叉引用。

结论

CorBank在快速检测同一整理基因组的参考序列数据库和基因组综述数据库版本之间存在大量差异方面非常高效。尽管这些差异作为反映不同观点是可以接受的,但我们建议两个基因组数据库的整理人员通过达成最小限度的对话并尽可能在技术上可行时发表另一个数据库的观点,来帮助减少进一步的分歧。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验