Computer and Information Sciences and Engineering Department, University of Florida, Gainesville, FL 32611, USA.
Bioinformatics. 2011 Jul 1;27(13):i149-58. doi: 10.1093/bioinformatics/btr203.
We consider the problem of similarity queries in biological network databases. Given a database of networks, similarity query returns all the database networks whose similarity (i.e. alignment score) to a given query network is at least a specified similarity cutoff value. Alignment of two networks is a very costly operation, which makes exhaustive comparison of all the database networks with a query impractical. To tackle this problem, we develop a novel indexing method, named RINQ (Reference-based Indexing for Biological Network Queries). Our method uses a set of reference networks to eliminate a large portion of the database quickly for each query. A reference network is a small biological network. We precompute and store the alignments of all the references with all the database networks. When our database is queried, we align the query network with all the reference networks. Using these alignments, we calculate a lower bound and an approximate upper bound to the alignment score of each database network with the query network. With the help of upper and lower bounds, we eliminate the majority of the database networks without aligning them to the query network. We also quickly identify a small portion of these as guaranteed to be similar to the query. We perform pairwise alignment only for the remaining networks. We also propose a supervised method to pick references that have a large chance of filtering the unpromising database networks. Extensive experimental evaluation suggests that (i) our method reduced the running time of a single query on a database of around 300 networks from over 2 days to only 8 h; (ii) our method outperformed the state of the art method Closure Tree and SAGA by a factor of three or more; and (iii) our method successfully identified statistically and biologically significant relationships across networks and organisms.
我们研究了生物网络数据库中的相似性查询问题。给定一个网络数据库,相似性查询返回所有与给定查询网络相似度(即对齐分数)至少达到指定相似度截止值的数据库网络。两个网络的对齐是一项非常昂贵的操作,使得对所有数据库网络与查询进行详尽的比较是不切实际的。为了解决这个问题,我们开发了一种新的索引方法,名为 RINQ(基于参考的生物网络查询索引)。我们的方法使用一组参考网络来快速排除每个查询的大部分数据库。参考网络是一个小型生物网络。我们预先计算并存储所有参考网络与所有数据库网络的对齐。当我们的数据库被查询时,我们将查询网络与所有参考网络对齐。使用这些对齐,我们计算每个数据库网络与查询网络的对齐分数的下界和近似上界。在上下界的帮助下,我们无需将它们与查询网络对齐就排除了大多数数据库网络。我们还快速确定其中一小部分网络与查询网络保证相似。我们仅对其余网络执行两两对齐。我们还提出了一种有监督的方法来选择参考网络,这些参考网络有很大的机会过滤掉不太可能的数据库网络。广泛的实验评估表明:(i)我们的方法将单个查询在大约 300 个网络的数据库上的运行时间从超过 2 天缩短到仅 8 小时;(ii)我们的方法比 Closure Tree 和 SAGA 等最先进的方法性能提高了三倍或更多;(iii)我们的方法成功地识别了网络和生物之间具有统计学和生物学意义的关系。