B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028 Barcelona, Spain.
Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Déu, 08950 Esplugues de Llobregat, Spain.
J Chem Inf Model. 2021 Apr 26;61(4):1657-1669. doi: 10.1021/acs.jcim.1c00086. Epub 2021 Mar 29.
In silico analysis of biological activity data has become an essential technique in pharmaceutical development. Specifically, the so-called proteochemometric models aim to share information between targets in machine learning ligand-target activity prediction models. However, bioactivity data sets used in proteochemometric modeling are usually imbalanced, which could potentially affect the performance of the models. In this work, we explored the effect of different balancing strategies in deep learning proteochemometric target-compound activity classification models while controlling for the compound series bias through clustering. These strategies were (1) no_resampling, (2) resampling_after_clustering, (3) resampling_before_clustering, and (4) semi_resampling. These schemas were evaluated in kinases, GPCRs, nuclear receptors, and proteases from BindingDB. We observed that the predicted proportion of positives was driven by the actual data balance in the test set. Additionally, it was confirmed that data balance had an impact on the performance estimates of the proteochemometric model. We recommend a combination of data augmentation and clustering in the training set (semi_resampling) to mitigate the data imbalance effect in a realistic scenario. The code of this analysis is publicly available at https://github.com/b2slab/imbalance_pcm_benchmark.
基于计算机的生物活性数据分析已成为药物开发的一项重要技术。具体来说,所谓的“定量构效关系-化学计量学模型”旨在通过机器学习配体-靶标活性预测模型共享靶标之间的信息。然而,定量构效关系建模中使用的生物活性数据集通常是不平衡的,这可能会影响模型的性能。在这项工作中,我们通过聚类来控制化合物系列偏差,探索了不同平衡策略对深度学习定量构效关系靶标-化合物活性分类模型的影响。这些策略是:(1)不重采样,(2)聚类后重采样,(3)聚类前重采样,和(4)半重采样。我们在 BindingDB 中的激酶、GPCR、核受体和蛋白酶中评估了这些方案。我们观察到,预测阳性的比例受测试集中实际数据平衡的驱动。此外,还证实数据平衡对定量构效关系模型的性能估计有影响。我们建议在训练集中结合数据增强和聚类(半重采样),以减轻现实场景中数据不平衡的影响。该分析的代码可在 https://github.com/b2slab/imbalance_pcm_benchmark 上公开获取。