Department of Genetics, University of Cambridge, Cambridge, UK.
STORM Therapeutics Ltd, Cambridge, UK.
Sci Rep. 2020 Jul 1;10(1):10787. doi: 10.1038/s41598-020-67846-1.
A major cause of failed drug discovery programs is suboptimal target selection, resulting in the development of drug candidates that are potent inhibitors, but ineffective at treating the disease. In the genomics era, the availability of large biomedical datasets with genome-wide readouts has the potential to transform target selection and validation. In this study we investigate how computational intelligence methods can be applied to predict novel therapeutic targets in oncology. We compared different machine learning classifiers applied to the task of drug target classification for nine different human cancer types. For each cancer type, a set of "known" target genes was obtained and equally-sized sets of "non-targets" were sampled multiple times from the human protein-coding genes. Models were trained on mutation, gene expression (TCGA), and gene essentiality (DepMap) data. In addition, we generated a numerical embedding of the interaction network of protein-coding genes using deep network representation learning and included the results in the modeling. We assessed feature importance using a random forests classifier and performed feature selection based on measuring permutation importance against a null distribution. Our best models achieved good generalization performance based on the AUROC metric. With the best model for each cancer type, we ran predictions on more than 15,000 protein-coding genes to identify potential novel targets. Our results indicate that this approach may be useful to inform early stages of the drug discovery pipeline.
药物研发项目失败的一个主要原因是目标选择不当,导致开发出的候选药物虽然是有效的抑制剂,但对治疗疾病无效。在基因组学时代,具有全基因组读数的大型生物医学数据集的可用性有可能改变目标选择和验证。在这项研究中,我们研究了计算智能方法如何应用于预测肿瘤学中的新型治疗靶标。我们比较了不同的机器学习分类器在九种不同人类癌症类型的药物靶标分类任务中的应用。对于每种癌症类型,我们获得了一组“已知”的靶基因,并从人类蛋白编码基因中多次随机抽取大小相等的“非靶”基因集。模型是基于突变、基因表达(TCGA)和基因必需性(DepMap)数据进行训练的。此外,我们使用深度网络表示学习生成了蛋白编码基因互作网络的数值嵌入,并将结果包含在建模中。我们使用随机森林分类器评估特征重要性,并根据对空分布的置换重要性进行特征选择。我们的最佳模型基于 AUROC 指标实现了良好的泛化性能。对于每种癌症类型的最佳模型,我们对超过 15000 个蛋白编码基因进行了预测,以确定潜在的新靶标。我们的结果表明,这种方法可能有助于为药物发现管道的早期阶段提供信息。