Maxwell Sean, Chance Mark R, Koyutürk Mehmet
Center for Proteomics and Bioinformatics.
Department of Nutrition.
Bioinformatics. 2017 May 1;33(9):1354-1361. doi: 10.1093/bioinformatics/btw733.
In recent years, various network proximity measures have been proposed to facilitate the use of biomolecular interaction data in a broad range of applications. These applications include functional annotation, disease gene prioritization, comparative analysis of biological systems and prediction of new interactions. In such applications, a major task is the scoring or ranking of the nodes in the network in terms of their proximity to a given set of 'seed' nodes (e.g. a group of proteins that are identified to be associated with a disease, or are deferentially expressed in a certain condition). Many different network proximity measures are utilized for this purpose, and these measures are quite diverse in terms of the benefits they offer.
We propose a unifying framework for characterizing network proximity measures for set-based queries. We observe that many existing measures are linear, in that the proximity of a node to a set of nodes can be represented as an aggregation of its proximity to the individual nodes in the set. Based on this observation, we propose methods for processing of set-based proximity queries that take advantage of sparse local proximity information. In addition, we provide an analytical framework for characterizing the distribution of proximity scores based on reference models that accurately capture the characteristics of the seed set (e.g. degree distribution and biological function). The resulting framework facilitates computation of exact figures for the statistical significance of network proximity scores, enabling assessment of the accuracy of Monte Carlo simulation based estimation methods.
Implementations of the methods in this paper are available at https://bioengine.case.edu/crosstalker which includes a robust visualization for results viewing.
stm@case.edu or mxk331@case.edu.
Supplementary data are available at Bioinformatics online.
近年来,人们提出了各种网络邻近性度量方法,以促进生物分子相互作用数据在广泛应用中的使用。这些应用包括功能注释、疾病基因优先级排序、生物系统的比较分析以及新相互作用的预测。在这类应用中,一项主要任务是根据网络中节点与给定一组“种子”节点(例如,一组被确定与疾病相关或在特定条件下差异表达的蛋白质)的邻近程度对节点进行评分或排序。为此使用了许多不同的网络邻近性度量方法,这些方法在提供的优势方面差异很大。
我们提出了一个统一框架,用于表征基于集合查询的网络邻近性度量方法。我们观察到,许多现有度量方法是线性的,即节点与一组节点的邻近程度可以表示为该节点与集合中各个节点邻近程度的汇总。基于这一观察结果,我们提出了利用稀疏局部邻近信息来处理基于集合的邻近性查询的方法。此外,我们提供了一个分析框架,用于基于能够准确捕捉种子集特征(例如度分布和生物学功能)的参考模型来表征邻近性得分的分布。由此产生的框架有助于计算网络邻近性得分统计显著性的精确数值,从而能够评估基于蒙特卡罗模拟的估计方法的准确性。
本文方法的实现可在https://bioengine.case.edu/crosstalker获取,其中包括用于结果查看的强大可视化功能。
补充数据可在《生物信息学》在线获取。