Biosciences Division, Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, TN 37831-6420, USA.
BMC Bioinformatics. 2010 Oct 7;11 Suppl 6(Suppl 6):S15. doi: 10.1186/1471-2105-11-S6-S15.
Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software system to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way.
Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files.
The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to researchers. In this way, all researchers, experimental and computational, will more easily understand the informatics processes constructing genome annotation and ultimately be able to help improve the systems that produce them.
传统的基因组注释系统是在一个非常不同的计算时代开发的,当时万维网刚刚出现。因此,这些系统被构建为集中式的黑盒,专注于生成高质量的注释提交给 GenBank/EMBL,由专家手动策展支持。序列数据的指数级增长推动了对越来越高质量和自动生成注释的需求不断增长。典型的注释管道利用传统的数据库技术、集群计算资源、Perl、C 和 UNIX 文件系统来处理原始序列数据、识别基因,并预测和分类基因功能。这些技术将注释软件系统与硬件和第三方软件(例如关系数据库系统和模式)紧密地结合在一起。这使得注释系统难以复制,随着时间的推移难以修改,难以评估,难以在多个地理位置之间划分,并且对于非领域专家来说难以理解。这些系统不容易受到审查,因此在科学上不可行。语义 Web 标准(如资源描述框架 (RDF) 和 OWL Web 本体语言 (OWL))的出现使我们能够以新的全面方式构建解决这些挑战的系统。
在这里,我们开发了一种将传统数据链接到基因组注释中的基于 OWL 的本体的框架。我们展示了数据标准如何将硬件和第三方软件工具与注释管道解耦,从而使注释管道更容易复制和评估。一个说明性示例展示了如何将 TURTLE(简洁 RDF 三元组语言)用作人类可读的、但也具有语义感知的等效物 GenBank/EMBL 文件。
这种方法的力量在于它能够将来自多个位置的多个数据库中的注释数据组装成一种研究人员可以理解的表示形式。通过这种方式,所有研究人员,无论是实验性的还是计算性的,都将更容易理解构建基因组注释的信息处理过程,并最终能够帮助改进生成它们的系统。