Guney Emre
Joint IRB-BSC-CRG Program in Computational Biology, Institute for Research in Biomedicine, c/ Baldiri Reixac 10-12, Barcelona, 08028, Spain,
Pac Symp Biocomput. 2017;22:132-143. doi: 10.1142/9789813207813_0014.
Repurposing existing drugs for new uses has attracted considerable attention over the past years. To identify potential candidates that could be repositioned for a new indication, many studies make use of chemical, target, and side effect similarity between drugs to train classifiers. Despite promising prediction accuracies of these supervised computational models, their use in practice, such as for rare diseases, is hindered by the assumption that there are already known and similar drugs for a given condition of interest. In this study, using publicly available data sets, we question the prediction accuracies of supervised approaches based on drug similarity when the drugs in the training and the test set are completely disjoint. We first build a Python platform to generate reproducible similarity-based drug repurposing models. Next, we show that, while a simple chemical, target, and side effect similarity based machine learning method can achieve good performance on the benchmark data set, the prediction performance drops sharply when the drugs in the folds of the cross validation are not overlapping and the similarity information within the training and test sets are used independently. These intriguing results suggest revisiting the assumptions underlying the validation scenarios of similarity-based methods and underline the need for unsupervised approaches to identify novel drug uses inside the unexplored pharmacological space. We make the digital notebook containing the Python code to replicate our analysis that involves the drug repurposing platform based on machine learning models and the proposed disjoint cross fold generation method freely available at github.com/emreg00/repurpose.
在过去几年中,将现有药物用于新用途已引起了相当大的关注。为了确定可重新定位用于新适应症的潜在候选药物,许多研究利用药物之间的化学、靶点和副作用相似性来训练分类器。尽管这些监督计算模型具有可观的预测准确性,但它们在实际应用中,例如用于罕见疾病时,却受到这样一种假设的阻碍,即对于感兴趣的特定病症已经存在已知的相似药物。在本研究中,我们使用公开可用的数据集,对训练集和测试集中的药物完全不相交时基于药物相似性的监督方法的预测准确性提出质疑。我们首先构建了一个Python平台,以生成可重复的基于相似性的药物重新利用模型。接下来,我们表明,虽然基于简单的化学、靶点和副作用相似性的机器学习方法在基准数据集上可以取得良好的性能,但当交叉验证各折中的药物不重叠且训练集和测试集内的相似性信息独立使用时,预测性能会急剧下降。这些有趣的结果表明,需要重新审视基于相似性方法的验证场景所依据的假设,并强调需要采用无监督方法来在未探索的药理空间内识别新的药物用途。我们将包含Python代码的数字笔记本公开,该代码用于复制我们的分析,其中涉及基于机器学习模型的药物重新利用平台以及所提出的不相交交叉折生成方法,可在github.com/emreg00/repurpose上免费获取。