Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei 106, Taiwan
BMC Bioinformatics. 2010 Apr 2;11:167. doi: 10.1186/1471-2105-11-167.
Elucidating protein-protein interactions (PPIs) is essential to constructing protein interaction networks and facilitating our understanding of the general principles of biological systems. Previous studies have revealed that interacting protein pairs can be predicted by their primary structure. Most of these approaches have achieved satisfactory performance on datasets comprising equal number of interacting and non-interacting protein pairs. However, this ratio is highly unbalanced in nature, and these techniques have not been comprehensively evaluated with respect to the effect of the large number of non-interacting pairs in realistic datasets. Moreover, since highly unbalanced distributions usually lead to large datasets, more efficient predictors are desired when handling such challenging tasks.
This study presents a method for PPI prediction based only on sequence information, which contributes in three aspects. First, we propose a probability-based mechanism for transforming protein sequences into feature vectors. Second, the proposed predictor is designed with an efficient classification algorithm, where the efficiency is essential for handling highly unbalanced datasets. Third, the proposed PPI predictor is assessed with several unbalanced datasets with different positive-to-negative ratios (from 1:1 to 1:15). This analysis provides solid evidence that the degree of dataset imbalance is important to PPI predictors.
Dealing with data imbalance is a key issue in PPI prediction since there are far fewer interacting protein pairs than non-interacting ones. This article provides a comprehensive study on this issue and develops a practical tool that achieves both good prediction performance and efficiency using only protein sequence information.
阐明蛋白质-蛋白质相互作用(PPIs)对于构建蛋白质相互作用网络以及促进我们对生物系统一般原理的理解至关重要。先前的研究表明,可以通过其一级结构预测相互作用的蛋白质对。这些方法中的大多数在包含等量相互作用和非相互作用蛋白质对的数据集上都取得了令人满意的性能。然而,这种比例在自然界中高度不平衡,并且这些技术尚未针对现实数据集中大量非相互作用对的影响进行全面评估。此外,由于高度不平衡的分布通常会导致大数据集,因此在处理此类具有挑战性的任务时,需要更有效的预测器。
本研究提出了一种仅基于序列信息的 PPI 预测方法,该方法在三个方面做出了贡献。首先,我们提出了一种基于概率的机制,可将蛋白质序列转换为特征向量。其次,所提出的预测器采用了一种有效的分类算法设计,该算法的效率对于处理高度不平衡的数据集至关重要。第三,使用不同的正-负比(从 1:1 到 1:15)的多个不平衡数据集评估了所提出的 PPI 预测器。该分析提供了确凿的证据,表明数据集的不平衡程度对于 PPI 预测器很重要。
处理数据不平衡是 PPI 预测中的一个关键问题,因为相互作用的蛋白质对比非相互作用的蛋白质对要少得多。本文全面研究了这一问题,并开发了一种实用工具,该工具仅使用蛋白质序列信息即可实现良好的预测性能和效率。