Liu Hui, Sun Jianjiang, Guan Jihong, Zheng Jie, Zhou Shuigeng
Lab of Information Management, Changzhou University, Jiangsu 213164, China, School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singapore, Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, Shanghai 200433, China and Department of Computer Science and Technology, Tongji University, Shanghai 201804, China Lab of Information Management, Changzhou University, Jiangsu 213164, China, School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singapore, Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, Shanghai 200433, China and Department of Computer Science and Technology, Tongji University, Shanghai 201804, China.
Lab of Information Management, Changzhou University, Jiangsu 213164, China, School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singapore, Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, Shanghai 200433, China and Department of Computer Science and Technology, Tongji University, Shanghai 201804, China.
Bioinformatics. 2015 Jun 15;31(12):i221-9. doi: 10.1093/bioinformatics/btv256.
Computational prediction of compound-protein interactions (CPIs) is of great importance for drug design and development, as genome-scale experimental validation of CPIs is not only time-consuming but also prohibitively expensive. With the availability of an increasing number of validated interactions, the performance of computational prediction approaches is severely impended by the lack of reliable negative CPI samples. A systematic method of screening reliable negative sample becomes critical to improving the performance of in silico prediction methods.
This article aims at building up a set of highly credible negative samples of CPIs via an in silico screening method. As most existing computational models assume that similar compounds are likely to interact with similar target proteins and achieve remarkable performance, it is rational to identify potential negative samples based on the converse negative proposition that the proteins dissimilar to every known/predicted target of a compound are not much likely to be targeted by the compound and vice versa. We integrated various resources, including chemical structures, chemical expression profiles and side effects of compounds, amino acid sequences, protein-protein interaction network and functional annotations of proteins, into a systematic screening framework. We first tested the screened negative samples on six classical classifiers, and all these classifiers achieved remarkably higher performance on our negative samples than on randomly generated negative samples for both human and Caenorhabditis elegans. We then verified the negative samples on three existing prediction models, including bipartite local model, Gaussian kernel profile and Bayesian matrix factorization, and found that the performances of these models are also significantly improved on the screened negative samples. Moreover, we validated the screened negative samples on a drug bioactivity dataset. Finally, we derived two sets of new interactions by training an support vector machine classifier on the positive interactions annotated in DrugBank and our screened negative interactions. The screened negative samples and the predicted interactions provide the research community with a useful resource for identifying new drug targets and a helpful supplement to the current curated compound-protein databases.
Supplementary files are available at: http://admis.fudan.edu.cn/negative-cpi/.
化合物 - 蛋白质相互作用(CPI)的计算预测对于药物设计和开发至关重要,因为对CPI进行全基因组规模的实验验证不仅耗时,而且成本过高。随着越来越多经过验证的相互作用的出现,由于缺乏可靠的负CPI样本,计算预测方法的性能受到严重影响。一种筛选可靠阴性样本的系统方法对于提高计算机预测方法的性能至关重要。
本文旨在通过计算机筛选方法构建一组高度可信的CPI阴性样本。由于大多数现有的计算模型假设相似的化合物可能与相似的靶蛋白相互作用并取得显著性能,基于相反的否定命题来识别潜在的阴性样本是合理的,即与化合物的每个已知/预测靶标不相似的蛋白质不太可能被该化合物靶向,反之亦然。我们将各种资源整合到一个系统的筛选框架中,这些资源包括化合物的化学结构、化学表达谱和副作用、氨基酸序列、蛋白质 - 蛋白质相互作用网络以及蛋白质的功能注释。我们首先在六个经典分类器上测试筛选出的阴性样本,对于人类和秀丽隐杆线虫,所有这些分类器在我们的阴性样本上的性能都明显高于在随机生成的阴性样本上的性能。然后我们在三个现有的预测模型上验证阴性样本,包括二分局部模型、高斯核轮廓和贝叶斯矩阵分解,发现这些模型在筛选出的阴性样本上的性能也有显著提高。此外,我们在一个药物生物活性数据集上验证了筛选出的阴性样本。最后,我们通过在DrugBank中注释的正相互作用和我们筛选出的负相互作用上训练支持向量机分类器,推导出两组新的相互作用。筛选出的阴性样本和预测的相互作用为研究界提供了一个识别新药物靶点的有用资源,以及对当前整理的化合物 - 蛋白质数据库的有益补充。
补充文件可在以下网址获取:http://admis.fudan.edu.cn/negative-cpi/