Leiden Academic Centre of Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands.
Galapagos NV, Generaal De Wittelaan L11 A3, 2800 Mechelen, Belgium.
J Chem Inf Model. 2023 Jun 26;63(12):3688-3696. doi: 10.1021/acs.jcim.3c00132. Epub 2023 Jun 9.
Protein kinases are a protein family that plays an important role in several complex diseases such as cancer and cardiovascular and immunological diseases. Protein kinases have conserved ATP binding sites, which when targeted can lead to similar activities of inhibitors against different kinases. This can be exploited to create multitarget drugs. On the other hand, selectivity (lack of similar activities) is desirable in order to avoid toxicity issues. There is a vast amount of protein kinase activity data in the public domain, which can be used in many different ways. Multitask machine learning models are expected to excel for these kinds of data sets because they can learn from implicit correlations between tasks (in this case activities against a variety of kinases). However, multitask modeling of sparse data poses two major challenges: (i) creating a balanced train-test split without data leakage and (ii) handling missing data. In this work, we construct a protein kinase benchmark set composed of two balanced splits without data leakage, using random and dissimilarity-driven cluster-based mechanisms, respectively. This data set can be used for benchmarking and developing protein kinase activity prediction models. Overall, the performance on the dissimilarity-driven cluster-based split is lower than on random split-based sets for all models, indicating poor generalizability of models. Nevertheless, we show that multitask deep learning models, on this very sparse data set, outperform single-task deep learning and tree-based models. Finally, we demonstrate that data imputation does not improve the performance of (multitask) models on this benchmark set.
蛋白激酶是一类在癌症、心血管和免疫性疾病等多种复杂疾病中发挥重要作用的蛋白质家族。蛋白激酶具有保守的 ATP 结合位点,靶向这些结合位点可以导致抑制剂对不同激酶产生相似的活性。这可以被利用来创建多靶标药物。另一方面,为了避免毒性问题,选择性(缺乏相似的活性)是理想的。公共领域中有大量的蛋白激酶活性数据,这些数据可以以许多不同的方式使用。多任务机器学习模型有望在这些数据集上表现出色,因为它们可以从任务之间的隐含相关性中学习(在这种情况下,针对各种激酶的活性)。然而,稀疏数据的多任务建模面临两个主要挑战:(i)在没有数据泄漏的情况下创建平衡的训练-测试分割,(ii)处理缺失数据。在这项工作中,我们使用随机和基于不相似性的聚类驱动的机制分别构建了由两个平衡分割组成的蛋白激酶基准数据集,没有数据泄漏。这个数据集可以用于基准测试和开发蛋白激酶活性预测模型。总的来说,对于所有模型,基于不相似性的聚类分割的性能都低于基于随机分割的数据集,这表明模型的泛化能力较差。然而,我们表明,在这个非常稀疏的数据集上,多任务深度学习模型优于单任务深度学习和基于树的模型。最后,我们证明在这个基准数据集上,数据插补不会提高(多任务)模型的性能。