Ezzat Ali, Wu Min, Li Xiao-Li, Kwoh Chee-Keong
School of Computer Science & Engineering, Nanyang Technological University, Nanyang Ave., Singapore, 639798, Singapore.
Institute for Infocomm Research (I2R), A*Star, Fusionopolis Way, Singapore, 138632, Singapore.
BMC Bioinformatics. 2016 Dec 22;17(Suppl 19):509. doi: 10.1186/s12859-016-1377-y.
Multiple computational methods for predicting drug-target interactions have been developed to facilitate the drug discovery process. These methods use available data on known drug-target interactions to train classifiers with the purpose of predicting new undiscovered interactions. However, a key challenge regarding this data that has not yet been addressed by these methods, namely class imbalance, is potentially degrading the prediction performance. Class imbalance can be divided into two sub-problems. Firstly, the number of known interacting drug-target pairs is much smaller than that of non-interacting drug-target pairs. This imbalance ratio between interacting and non-interacting drug-target pairs is referred to as the between-class imbalance. Between-class imbalance degrades prediction performance due to the bias in prediction results towards the majority class (i.e. the non-interacting pairs), leading to more prediction errors in the minority class (i.e. the interacting pairs). Secondly, there are multiple types of drug-target interactions in the data with some types having relatively fewer members (or are less represented) than others. This variation in representation of the different interaction types leads to another kind of imbalance referred to as the within-class imbalance. In within-class imbalance, prediction results are biased towards the better represented interaction types, leading to more prediction errors in the less represented interaction types.
We propose an ensemble learning method that incorporates techniques to address the issues of between-class imbalance and within-class imbalance. Experiments show that the proposed method improves results over 4 state-of-the-art methods. In addition, we simulated cases for new drugs and targets to see how our method would perform in predicting their interactions. New drugs and targets are those for which no prior interactions are known. Our method displayed satisfactory prediction performance and was able to predict many of the interactions successfully.
Our proposed method has improved the prediction performance over the existing work, thus proving the importance of addressing problems pertaining to class imbalance in the data.
为了促进药物发现过程,已经开发了多种预测药物-靶点相互作用的计算方法。这些方法利用已知药物-靶点相互作用的可用数据来训练分类器,目的是预测新的未发现的相互作用。然而,这些方法尚未解决的关于该数据的一个关键挑战,即类不平衡,可能会降低预测性能。类不平衡可分为两个子问题。首先,已知相互作用的药物-靶点对的数量远小于非相互作用的药物-靶点对的数量。相互作用和非相互作用的药物-靶点对之间的这种不平衡比率被称为类间不平衡。类间不平衡会降低预测性能,因为预测结果偏向于多数类(即非相互作用对),导致少数类(即相互作用对)出现更多预测错误。其次,数据中存在多种类型的药物-靶点相互作用,其中一些类型的成员相对较少(或代表性较低)。不同相互作用类型的这种代表性差异导致了另一种不平衡,称为类内不平衡。在类内不平衡中,预测结果偏向于代表性较好的相互作用类型,导致代表性较低的相互作用类型出现更多预测错误。
我们提出了一种集成学习方法,该方法结合了处理类间不平衡和类内不平衡问题的技术。实验表明,所提出的方法比4种现有先进方法的结果有所改进。此外,我们模拟了新药和靶点的情况,以了解我们的方法在预测它们的相互作用时的表现。新药和靶点是那些之前没有已知相互作用的。我们的方法显示出令人满意的预测性能,并且能够成功预测许多相互作用。
我们提出的方法比现有工作提高了预测性能,从而证明了解决数据中与类不平衡相关问题的重要性。