Biocomputing Platforms Ltd, Innopoli 2, Tekniikantie 14, , FI-02150 Espoo, Finland.
BMC Bioinformatics. 2012 Jun 6;13:119. doi: 10.1186/1471-2105-13-119.
Biological databases contain large amounts of data concerning the functions and associations of genes and proteins. Integration of data from several such databases into a single repository can aid the discovery of previously unknown connections spanning multiple types of relationships and databases.
Biomine is a system that integrates cross-references from several biological databases into a graph model with multiple types of edges, such as protein interactions, gene-disease associations and gene ontology annotations. Edges are weighted based on their type, reliability, and informativeness. We present Biomine and evaluate its performance in link prediction, where the goal is to predict pairs of nodes that will be connected in the future, based on current data. In particular, we formulate protein interaction prediction and disease gene prioritization tasks as instances of link prediction. The predictions are based on a proximity measure computed on the integrated graph. We consider and experiment with several such measures, and perform a parameter optimization procedure where different edge types are weighted to optimize link prediction accuracy. We also propose a novel method for disease-gene prioritization, defined as finding a subset of candidate genes that cluster together in the graph. We experimentally evaluate Biomine by predicting future annotations in the source databases and prioritizing lists of putative disease genes.
The experimental results show that Biomine has strong potential for predicting links when a set of selected candidate links is available. The predictions obtained using the entire Biomine dataset are shown to clearly outperform ones obtained using any single source of data alone, when different types of links are suitably weighted. In the gene prioritization task, an established reference set of disease-associated genes is useful, but the results show that under favorable conditions, Biomine can also perform well when no such information is available.The Biomine system is a proof of concept. Its current version contains 1.1 million entities and 8.1 million relations between them, with focus on human genetics. Some of its functionalities are available in a public query interface at http://biomine.cs.helsinki.fi, allowing searching for and visualizing connections between given biological entities.
生物数据库包含大量关于基因和蛋白质功能及关联的信息。将来自多个此类数据库的数据整合到一个单一的存储库中,可以帮助发现以前未知的跨越多种关系和数据库的连接。
Biomine 是一个系统,它将来自几个生物数据库的交叉引用整合到一个具有多种类型边的图形模型中,例如蛋白质相互作用、基因-疾病关联和基因本体论注释。边根据其类型、可靠性和信息量进行加权。我们介绍了 Biomine 并评估了它在链接预测中的性能,链接预测的目标是根据当前数据预测未来将连接的节点对。具体来说,我们将蛋白质相互作用预测和疾病基因优先级任务作为链接预测的实例。预测是基于在整合图上计算的接近度度量得出的。我们考虑并实验了几种这样的度量标准,并执行了一个参数优化过程,其中对不同的边类型进行加权以优化链接预测的准确性。我们还提出了一种新的疾病基因优先级排序方法,即将候选基因集中在一起形成一个聚类。我们通过预测源数据库中的未来注释和优先排序候选基因列表来实验评估 Biomine。
实验结果表明,当有一组选定的候选链接时,Biomine 具有很强的链接预测能力。当适当加权不同类型的链接时,使用整个 Biomine 数据集获得的预测明显优于仅使用单个数据源获得的预测。在基因优先级排序任务中,一个已建立的疾病相关基因参考集是有用的,但结果表明,在有利条件下,当没有此类信息可用时,Biomine 也可以表现良好。Biomine 系统是一个概念验证。它的当前版本包含 110 万个实体和它们之间的 810 万个关系,重点是人类遗传学。其部分功能可在公共查询界面 http://biomine.cs.helsinki.fi 上使用,允许搜索和可视化给定生物实体之间的连接。