School of Computer Engineering, Nanyang Technological University, Singapore.
Proteins. 2013 Nov;81(11):2023-33. doi: 10.1002/prot.24365. Epub 2013 Aug 23.
With the increasing availability of diverse biological information for proteins, integration of heterogeneous data becomes more useful for many problems in proteomics, such as annotating protein functions, predicting novel protein-protein interactions and so on. In this paper, we present an integrative approach called InteHC (Integrative Hierarchical Clustering) to identify protein complexes from multiple data sources. Although integrating multiple sources could effectively improve the coverage of current insufficient protein interactome (the false negative issue), it could also introduce potential false-positive interactions that could hurt the performance of protein complex prediction. Our proposed InteHC method can effectively address these issues to facilitate accurate protein complex prediction and it is summarized into the following three steps. First, for each individual source/feature, InteHC computes the matrices to store the affinity scores between a protein pair that indicate their propensity to interact or co-complex relationship. Second, InteHC computes a final score matrix, which is the weighted sum of affinity scores from individual sources. In particular, the weights indicating the reliability of individual sources are learned from a supervised model (i.e., a linear ranking SVM). Finally, a hierarchical clustering algorithm is performed on the final score matrix to generate clusters as predicted protein complexes. In our experiments, we compared the results collected by our hierarchical clustering on each individual feature with those predicted by InteHC on the combined matrix. We observed that integration of heterogeneous data significantly benefits the identification of protein complexes. Moreover, a comprehensive comparison demonstrates that InteHC performs much better than 14 state-of-the-art approaches. All the experimental data and results can be downloaded from http://www.ntu.edu.sg/home/zhengjie/data/InteHC.
随着越来越多的蛋白质生物信息的出现,将异构数据集成对于蛋白质组学中的许多问题变得更加有用,例如注释蛋白质功能、预测新的蛋白质-蛋白质相互作用等。在本文中,我们提出了一种称为 InteHC(集成层次聚类)的综合方法,用于从多个数据源中识别蛋白质复合物。虽然整合多个来源可以有效地提高当前不足的蛋白质互作组的覆盖范围(假阴性问题),但它也可能引入潜在的假阳性相互作用,从而影响蛋白质复合物预测的性能。我们提出的 InteHC 方法可以有效地解决这些问题,有助于准确预测蛋白质复合物,它可以总结为以下三个步骤。首先,对于每个单独的来源/特征,InteHC 计算矩阵以存储蛋白质对之间的亲和度得分,这些得分表明它们相互作用或共同复合物关系的倾向。其次,InteHC 计算最终得分矩阵,这是来自各个来源的亲和度得分的加权和。特别是,指示各个来源可靠性的权重是从有监督模型(即线性排序 SVM)中学习到的。最后,在最终得分矩阵上执行层次聚类算法,以生成预测的蛋白质复合物簇。在我们的实验中,我们比较了层次聚类在每个单独特征上收集的结果与 InteHC 在组合矩阵上预测的结果。我们观察到异构数据的集成显著有利于蛋白质复合物的识别。此外,全面比较表明 InteHC 比 14 种最先进的方法表现要好得多。所有实验数据和结果都可以从 http://www.ntu.edu.sg/home/zhengjie/data/InteHC 下载。