Institute of Intelligent Machines, Chinese Academy of Sciences, PO Box 1130, Hefei 230031, China.
IEEE/ACM Trans Comput Biol Bioinform. 2012 Jul-Aug;9(4):1155-65. doi: 10.1109/TCBB.2012.58.
Sequence-based understanding and identification of protein binding interfaces is a challenging research topic due to the complexity in protein systems and the imbalanced distribution between interface and noninterface residues. This paper presents an outlier detection idea to address the redundancy problem in protein interaction data. The cleaned training data are then used for improving the prediction performance. We use three novel measures to describe the extent a residue is considered as an outlier in comparison to the other residues: the distance of a residue instance from the center instance of all residue instances of the same class label (Dist), the probability of the class label of the residue instance (PCL), and the importance of within-class and between-class (IWB) residue instances. Outlier scores are computed by integrating the three factors; instances with a sufficiently large score are treated as outliers and removed. The data sets without outliers are taken as input for a support vector machine (SVM) ensemble. The proposed SVM ensemble trained on input data without outliers performs better than that with outliers. Our method is also more accurate than many literature methods on benchmark data sets. From our empirical studies, we found that some outlier interface residues are truly near to noninterface regions, and some outlier noninterface residues are close to interface regions.
基于序列的蛋白质结合界面理解和识别是一个具有挑战性的研究课题,这是由于蛋白质系统的复杂性以及界面和非界面残基之间的不平衡分布所致。本文提出了一种异常值检测思想来解决蛋白质相互作用数据中的冗余问题。然后,使用清理后的训练数据来提高预测性能。我们使用三个新的度量标准来描述与同一类标签的所有残基实例的中心实例相比,一个残基实例被视为异常值的程度:残基实例与所有残基实例的中心实例的距离(Dist)、残基实例的类标签的概率(PCL)以及类内和类间残基实例的重要性(IWB)。异常值得分通过整合这三个因素来计算;得分足够大的实例被视为异常值并被删除。没有异常值的数据集被用作支持向量机(SVM)集成的输入。在没有异常值的输入数据上训练的 SVM 集成比有异常值的 SVM 集成表现更好。我们的方法在基准数据集上也比许多文献方法更准确。从我们的实证研究中,我们发现一些异常值界面残基确实接近非界面区域,而一些异常值非界面残基接近界面区域。