Zhou Jie
Guangdong Province Key Laboratory of Computer Network, School of Computer Science and Engineering, South China University of Technology , Guangzhou, China .
J Comput Biol. 2017 Feb;24(2):183-192. doi: 10.1089/cmb.2015.0233. Epub 2016 Aug 16.
There are many computational approaches to predict the protein-protein interactions using support vector machines (SVMs) with high performance. In fact, performance of currently reported methods are significantly over-estimated and affected by the object repetitiveness in the datasets used.
To study the effect of object repetitiveness of datasets on predicting results.
We present novel methods to construct different positive datasets with or without repeating proteins using graph maximum matching in the protein-protein interaction datasets and corresponding series of negative datasets with different proteins repetitiveness are constructed using graph adjacency matrix. The relationship between the SVM prediction results and the repeated proteins (repeat numbers and repeat rates) and the distributions of repeated proteins in the datasets are analyzed.
Protein repetitiveness of positive and negative datasets can affect the prediction result: high protein repetitiveness of positive or negative datasets yield high performance prediction result.
This indicate that dealing with object repetitiveness of datasets is a key issue in protein-protein interactions prediction using SVMs since real world data contain certain degrees of repeat proteins.
有许多计算方法可使用支持向量机(SVM)来高性能地预测蛋白质-蛋白质相互作用。事实上,目前报道的方法的性能被显著高估,并且受到所用数据集中对象重复性的影响。
研究数据集的对象重复性对预测结果的影响。
我们提出了新颖的方法,通过在蛋白质-蛋白质相互作用数据集中使用图最大匹配来构建有无重复蛋白质的不同正数据集,并使用图邻接矩阵构建具有不同蛋白质重复性的相应系列负数据集。分析了支持向量机预测结果与重复蛋白质(重复次数和重复率)之间的关系以及数据集中重复蛋白质的分布。
正数据集和负数据集的蛋白质重复性会影响预测结果:正数据集或负数据集的高蛋白重复性会产生高性能的预测结果。
这表明处理数据集的对象重复性是使用支持向量机进行蛋白质-蛋白质相互作用预测的关键问题,因为现实世界的数据包含一定程度的重复蛋白质。