Li Minjie, Wu Ziheng, Wang Wenyan, Lu Kun, Zhang Jun, Zhou Yuming, Chen Zhaoquan, Li Dan, Zheng Shicheng, Chen Peng, Wang Bing
IEEE/ACM Trans Comput Biol Bioinform. 2022 Nov-Dec;19(6):3646-3654. doi: 10.1109/TCBB.2021.3123269. Epub 2022 Dec 8.
The computational methods of protein-protein interaction sites prediction can effectively avoid the shortcomings of high cost and time in traditional experimental approaches. However, the serious class imbalance between interface and non-interface residues on the protein sequences limits the prediction performance of these methods. This work therefore proposed a new strategy, NearMiss-based under-sampling for unbalancing datasets and Random Forest classification (NM-RF), to predict protein interaction sites. Herein, the residues on protein sequences were represented by the PSSM-derived features, hydropathy index (HI) and relative solvent accessibility (RSA). In order to resolve the class imbalance problem, an under-sampling method based on NearMiss algorithm is adopted to remove some non-interface residues, and then the random forest algorithm is used to perform binary classification on the balanced feature datasets. Experiments show that the accuracy of NM-RF model reaches 87.6% and 84.3% on Dtestset72 and PDBtestset164 respectively, which demonstrate the effectiveness of the proposed NM-RF method in differentiating the interface or non-interface residues.
蛋白质-蛋白质相互作用位点预测的计算方法能够有效避免传统实验方法成本高和耗时的缺点。然而,蛋白质序列上界面残基和非界面残基之间严重的类别不平衡限制了这些方法的预测性能。因此,这项工作提出了一种新策略,即基于NearMiss的不平衡数据集欠采样和随机森林分类(NM-RF)来预测蛋白质相互作用位点。在此,蛋白质序列上的残基由基于位置特异性得分矩阵(PSSM)的特征、亲水性指数(HI)和相对溶剂可及性(RSA)表示。为了解决类别不平衡问题,采用基于NearMiss算法的欠采样方法去除一些非界面残基,然后使用随机森林算法对平衡的特征数据集进行二元分类。实验表明,NM-RF模型在Dtestset72和PDBtestset164上的准确率分别达到87.6%和84.3%,这证明了所提出的NM-RF方法在区分界面或非界面残基方面的有效性。