使用本体和通用数据结构连接实验结果、生物网络和序列分析方法。

Linking experimental results, biological networks and sequence analysis methods using Ontologies and Generalised Data Structures.

作者信息

Koehler Jacob, Rawlings Chris, Verrier Paul, Mitchell Rowan, Skusa Andre, Ruegg Alexander, Philippi Stephan

机构信息

Rothamsted Research, BAB division, Harpenden, UK.

出版信息

In Silico Biol. 2005;5(1):33-44.

PMID:15972003

Abstract

The structure of a closely integrated data warehouse is described that is designed to link different types and varying numbers of biological networks, sequence analysis methods and experimental results such as those coming from microarrays. The data schema is inspired by a combination of graph based methods and generalised data structures and makes use of ontologies and meta-data. The core idea is to consider and store biological networks as graphs, and to use generalised data structures (GDS) for the storage of further relevant information. This is possible because many biological networks can be stored as graphs: protein interactions, signal transduction networks, metabolic pathways, gene regulatory networks etc. Nodes in biological graphs represent entities such as promoters, proteins, genes and transcripts whereas the edges of such graphs specify how the nodes are related. The semantics of the nodes and edges are defined using ontologies of node and relation types. Besides generic attributes that most biological entities possess (name, attribute description), further information is stored using generalised data structures. By directly linking to underlying sequences (exons, introns, promoters, amino acid sequences) in a systematic way, close interoperability to sequence analysis methods can be achieved. This approach allows us to store, query and update a wide variety of biological information in a way that is semantically compact without requiring changes at the database schema level when new kinds of biological information is added. We describe how this datawarehouse is being implemented by extending the text-mining framework ONDEX to link, support and complement different bioinformatics applications and research activities such as microarray analysis, sequence analysis and modelling/simulation of biological systems. The system is developed under the GPL license and can be downloaded from http://sourceforge.net/projects/ondex/

摘要

本文描述了一个紧密集成的数据仓库结构，该结构旨在连接不同类型和数量各异的生物网络、序列分析方法以及实验结果，如来自微阵列的实验结果。数据模式受基于图的方法和通用数据结构的组合启发，并利用了本体和元数据。核心思想是将生物网络视为图来考虑和存储，并使用通用数据结构（GDS）来存储其他相关信息。这是可行的，因为许多生物网络都可以存储为图：蛋白质相互作用、信号转导网络、代谢途径、基因调控网络等。生物图中的节点表示启动子、蛋白质、基因和转录本等实体，而这些图的边则指定了节点之间的关系。节点和边的语义使用节点和关系类型的本体来定义。除了大多数生物实体具有的通用属性（名称、属性描述）外，还使用通用数据结构存储进一步的信息。通过以系统的方式直接链接到基础序列（外显子、内含子、启动子、氨基酸序列），可以实现与序列分析方法的紧密互操作性。这种方法使我们能够以语义紧凑的方式存储、查询和更新各种生物信息，并且在添加新类型的生物信息时无需在数据库模式级别进行更改。我们描述了如何通过扩展文本挖掘框架ONDEX来实现这个数据仓库，以链接、支持和补充不同的生物信息学应用和研究活动，如微阵列分析、序列分析以及生物系统的建模/模拟。该系统是在GPL许可下开发的，可以从http://sourceforge.net/projects/ondex/下载。