下一代微生物生物注释存储和表示模型。

Next generation models for storage and representation of microbial biological annotation.

机构信息

Biosciences Division, Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, TN 37831-6420, USA.

出版信息

BMC Bioinformatics. 2010 Oct 7;11 Suppl 6(Suppl 6):S15. doi: 10.1186/1471-2105-11-S6-S15.

DOI:10.1186/1471-2105-11-S6-S15

PMID:20946598

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3026362/

Abstract

BACKGROUND

Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software system to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way.

RESULTS

Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files.

CONCLUSIONS

The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to researchers. In this way, all researchers, experimental and computational, will more easily understand the informatics processes constructing genome annotation and ultimately be able to help improve the systems that produce them.

摘要

背景

传统的基因组注释系统是在一个非常不同的计算时代开发的，当时万维网刚刚出现。因此，这些系统被构建为集中式的黑盒，专注于生成高质量的注释提交给 GenBank/EMBL，由专家手动策展支持。序列数据的指数级增长推动了对越来越高质量和自动生成注释的需求不断增长。典型的注释管道利用传统的数据库技术、集群计算资源、Perl、C 和 UNIX 文件系统来处理原始序列数据、识别基因，并预测和分类基因功能。这些技术将注释软件系统与硬件和第三方软件（例如关系数据库系统和模式）紧密地结合在一起。这使得注释系统难以复制，随着时间的推移难以修改，难以评估，难以在多个地理位置之间划分，并且对于非领域专家来说难以理解。这些系统不容易受到审查，因此在科学上不可行。语义 Web 标准（如资源描述框架 (RDF) 和 OWL Web 本体语言 (OWL)）的出现使我们能够以新的全面方式构建解决这些挑战的系统。

结果

在这里，我们开发了一种将传统数据链接到基因组注释中的基于 OWL 的本体的框架。我们展示了数据标准如何将硬件和第三方软件工具与注释管道解耦，从而使注释管道更容易复制和评估。一个说明性示例展示了如何将 TURTLE（简洁 RDF 三元组语言）用作人类可读的、但也具有语义感知的等效物 GenBank/EMBL 文件。

结论

这种方法的力量在于它能够将来自多个位置的多个数据库中的注释数据组装成一种研究人员可以理解的表示形式。通过这种方式，所有研究人员，无论是实验性的还是计算性的，都将更容易理解构建基因组注释的信息处理过程，并最终能够帮助改进生成它们的系统。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a23d/3026362/80bf627f7d4b/1471-2105-11-S6-S15-1.jpg

相似文献

Next generation models for storage and representation of microbial biological annotation.

BMC Bioinformatics. 2010 Oct 7;11 Suppl 6(Suppl 6):S15. doi: 10.1186/1471-2105-11-S6-S15.

MicroScope-an integrated resource for community expertise of gene functions and comparative analysis of microbial genomic and metabolic data.

Brief Bioinform. 2019 Jul 19;20(4):1071-1084. doi: 10.1093/bib/bbx113.

Supporting community annotation and user collaboration in the integrated microbial genomes (IMG) system.

BMC Genomics. 2016 Apr 26;17:307. doi: 10.1186/s12864-016-2629-y.

G-OnRamp: Generating genome browsers to facilitate undergraduate-driven collaborative genome annotation.

PLoS Comput Biol. 2020 Jun 4;16(6):e1007863. doi: 10.1371/journal.pcbi.1007863. eCollection 2020 Jun.

CODON-Software to manual curation of prokaryotic genomes.

PLoS Comput Biol. 2021 Mar 31;17(3):e1008797. doi: 10.1371/journal.pcbi.1008797. eCollection 2021 Mar.

HAMAP as SPARQL rules-A portable annotation pipeline for genomes and proteomes.

Gigascience. 2020 Feb 1;9(2). doi: 10.1093/gigascience/giaa003.

Beav: a bacterial genome and mobile element annotation pipeline.

mSphere. 2024 Aug 28;9(8):e0020924. doi: 10.1128/msphere.00209-24. Epub 2024 Jul 22.

MaGe: a microbial genome annotation system supported by synteny results.

Nucleic Acids Res. 2006 Jan 10;34(1):53-65. doi: 10.1093/nar/gkj406. Print 2006.

Assembly, Annotation, and Comparative Genomics in PATRIC, the All Bacterial Bioinformatics Resource Center.

Methods Mol Biol. 2018;1704:79-101. doi: 10.1007/978-1-4939-7463-4_4.

引用本文的文献

Metabolomics and modelling approaches for systems metabolic engineering.

Metab Eng Commun. 2022 Oct 15;15:e00209. doi: 10.1016/j.mec.2022.e00209. eCollection 2022 Dec.

Systems biology approaches integrated with artificial intelligence for optimized metabolic engineering.

Metab Eng Commun. 2020 Dec;11:e00149. doi: 10.1016/j.mec.2020.e00149. Epub 2020 Oct 9.

Repositioning microbial biotechnology against COVID-19: the case of microbial production of flavonoids.

Microb Biotechnol. 2021 Jan;14(1):94-110. doi: 10.1111/1751-7915.13675. Epub 2020 Oct 13.

WikiGenomes: an open web application for community consumption and curation of gene annotation data in Wikidata.

Database (Oxford). 2017 Jan 1;2017(1). doi: 10.1093/database/bax025.

A knowledge base of vasopressin actions in the kidney.

Am J Physiol Renal Physiol. 2014 Sep 15;307(6):F747-55. doi: 10.1152/ajprenal.00012.2014. Epub 2014 Jul 23.

Proceedings of the 2011 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) conference. Introduction.

BMC Bioinformatics. 2011 Oct 18;12 Suppl 10(Suppl 10):S1. doi: 10.1186/1471-2105-12-S10-S1.

Proceedings of the 2010 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) conference.

BMC Bioinformatics. 2010 Oct 7;11 Suppl 6(Suppl 6):S1. doi: 10.1186/1471-2105-11-S6-S1.

本文引用的文献

Prodigal: prokaryotic gene recognition and translation initiation site identification.

BMC Bioinformatics. 2010 Mar 8;11:119. doi: 10.1186/1471-2105-11-119.

Initial implementation of a comparative data analysis ontology.

Evol Bioinform Online. 2009 Jul 3;5:47-66. doi: 10.4137/ebo.s2320.

SSWAP: A Simple Semantic Web Architecture and Protocol for semantic web services.

BMC Bioinformatics. 2009 Sep 23;10:309. doi: 10.1186/1471-2105-10-309.

GMODWeb: a web framework for the Generic Model Organism Database.

Genome Biol. 2008;9(6):R102. doi: 10.1186/gb-2008-9-6-r102. Epub 2008 Jun 20.

A Chado case study: an ontology-based modular schema for representing genome-associated biological information.

Bioinformatics. 2007 Jul 1;23(13):i337-46. doi: 10.1093/bioinformatics/btm189.

Advancing translational research with the Semantic Web.

BMC Bioinformatics. 2007 May 9;8 Suppl 3(Suppl 3):S2. doi: 10.1186/1471-2105-8-S3-S2.

Genome re-annotation: a wiki solution?

Genome Biol. 2007;8(1):102. doi: 10.1186/gb-2007-8-1-102.

Model storage, exchange and integration.

BMC Neurosci. 2006 Oct 30;7 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2202-7-S1-S11.

The Sequence Ontology: a tool for the unification of genome annotations.

Genome Biol. 2005;6(5):R44. doi: 10.1186/gb-2005-6-5-r44. Epub 2005 Apr 29.

An evidence ontology for use in pathway/genome databases.

Pac Symp Biocomput. 2004:190-201. doi: 10.1142/9789812704856_0019.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

下一代微生物生物注释存储和表示模型。

Next generation models for storage and representation of microbial biological annotation.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献