Kumari Priyanka, Nath Abhigyan, Chaube Radha
Bioinformatics Section, Mahila Mahavidyalaya, Banaras Hindu University, Varanasi 221005, India.
Zoology/Bioinformatic Section, Mahila Mahavidyalaya, Banaras Hindu University, Varanasi 221005, India.
Comput Biol Med. 2015 Jan;56:175-81. doi: 10.1016/j.compbiomed.2014.11.008. Epub 2014 Nov 20.
Identification of potential drug targets is a crucial task in the drug-discovery pipeline. Successful identification of candidate drug targets in entire genomes is very useful, and computational prediction methods can speed up this process. In the current work we have developed a sequence-based prediction method for the successful identification and discrimination of human drug target proteins, from human non-drug target proteins. The training features include sequence-based features, such as amino acid composition, amino acid property group composition, and dipeptide composition for generating predictive models. The classification of human drug target proteins presents a classic example of class imbalance. We have addressed this issue by using SMOTE (Synthetic Minority Over-sampling Technique) as a preprocessing step, for balancing the training data with a ratio of 1:1 between drug targets (minority samples) and non-drug targets (majority samples). Using ensemble classification learning method-Rotation Forest and ReliefF feature-selection technique for selecting the optimal subset of salient features, the best model with selected features can achieve 87.1% sensitivity, 83.6% specificity, and 85.3% accuracy, with 0.71 Matthews correlation coefficient (mcc) on a tenfold stratified cross-validation test. The subset of identified optimal features may help in assessing the compositional patterns in human drug targets. For further validation, using a rigorous leave-one-out cross-validation test, the model achieved 88.1% sensitivity, 83.0% specificity, 85.5% accuracy, and 0.712 mcc. The proposed method was tested on a second dataset, for which the current pipeline gave promising results. We suggest that the present approach can be applied successfully as a complementary tool to existing methods for novel drug target prediction.
识别潜在的药物靶点是药物研发流程中的一项关键任务。在整个基因组中成功识别候选药物靶点非常有用,而计算预测方法可以加速这一过程。在当前的工作中,我们开发了一种基于序列的预测方法,用于成功识别和区分人类药物靶点蛋白与人类非药物靶点蛋白。训练特征包括基于序列的特征,如氨基酸组成、氨基酸属性组组成以及用于生成预测模型的二肽组成。人类药物靶点蛋白的分类呈现出典型的类别不平衡示例。我们通过使用SMOTE(合成少数类过采样技术)作为预处理步骤来解决这个问题,以使训练数据中药物靶点(少数样本)和非药物靶点(多数样本)的比例达到1:1。使用集成分类学习方法——旋转森林和ReliefF特征选择技术来选择显著特征的最优子集,在十倍分层交叉验证测试中,具有所选特征的最佳模型可实现87.1%的灵敏度、83.6%的特异性和85.3%的准确率,马修斯相关系数(mcc)为0.71。所识别的最优特征子集可能有助于评估人类药物靶点中的组成模式。为了进一步验证,在严格的留一法交叉验证测试中,该模型实现了88.1%的灵敏度、83.0%的特异性、85.5%的准确率和0.712的mcc。该方法在第二个数据集上进行了测试,当前流程在该数据集上取得了有前景的结果。我们建议,本方法可作为现有新药靶点预测方法的补充工具成功应用。