大规模稀疏蛋白激酶活性数据建模。

Large-Scale Modeling of Sparse Protein Kinase Activity Data.

机构信息

Leiden Academic Centre of Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands.

Galapagos NV, Generaal De Wittelaan L11 A3, 2800 Mechelen, Belgium.

出版信息

J Chem Inf Model. 2023 Jun 26;63(12):3688-3696. doi: 10.1021/acs.jcim.3c00132. Epub 2023 Jun 9.

DOI:10.1021/acs.jcim.3c00132

PMID:37294674

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10302492/

Abstract

Protein kinases are a protein family that plays an important role in several complex diseases such as cancer and cardiovascular and immunological diseases. Protein kinases have conserved ATP binding sites, which when targeted can lead to similar activities of inhibitors against different kinases. This can be exploited to create multitarget drugs. On the other hand, selectivity (lack of similar activities) is desirable in order to avoid toxicity issues. There is a vast amount of protein kinase activity data in the public domain, which can be used in many different ways. Multitask machine learning models are expected to excel for these kinds of data sets because they can learn from implicit correlations between tasks (in this case activities against a variety of kinases). However, multitask modeling of sparse data poses two major challenges: (i) creating a balanced train-test split without data leakage and (ii) handling missing data. In this work, we construct a protein kinase benchmark set composed of two balanced splits without data leakage, using random and dissimilarity-driven cluster-based mechanisms, respectively. This data set can be used for benchmarking and developing protein kinase activity prediction models. Overall, the performance on the dissimilarity-driven cluster-based split is lower than on random split-based sets for all models, indicating poor generalizability of models. Nevertheless, we show that multitask deep learning models, on this very sparse data set, outperform single-task deep learning and tree-based models. Finally, we demonstrate that data imputation does not improve the performance of (multitask) models on this benchmark set.

摘要

蛋白激酶是一类在癌症、心血管和免疫性疾病等多种复杂疾病中发挥重要作用的蛋白质家族。蛋白激酶具有保守的 ATP 结合位点，靶向这些结合位点可以导致抑制剂对不同激酶产生相似的活性。这可以被利用来创建多靶标药物。另一方面，为了避免毒性问题，选择性（缺乏相似的活性）是理想的。公共领域中有大量的蛋白激酶活性数据，这些数据可以以许多不同的方式使用。多任务机器学习模型有望在这些数据集上表现出色，因为它们可以从任务之间的隐含相关性中学习（在这种情况下，针对各种激酶的活性）。然而，稀疏数据的多任务建模面临两个主要挑战：（i）在没有数据泄漏的情况下创建平衡的训练-测试分割，（ii）处理缺失数据。在这项工作中，我们使用随机和基于不相似性的聚类驱动的机制分别构建了由两个平衡分割组成的蛋白激酶基准数据集，没有数据泄漏。这个数据集可以用于基准测试和开发蛋白激酶活性预测模型。总的来说，对于所有模型，基于不相似性的聚类分割的性能都低于基于随机分割的数据集，这表明模型的泛化能力较差。然而，我们表明，在这个非常稀疏的数据集上，多任务深度学习模型优于单任务深度学习和基于树的模型。最后，我们证明在这个基准数据集上，数据插补不会提高（多任务）模型的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b742/10302492/0fa242bf87ac/ci3c00132_0001.jpg

相似文献

Large-Scale Modeling of Sparse Protein Kinase Activity Data.大规模稀疏蛋白激酶活性数据建模。

J Chem Inf Model. 2023 Jun 26;63(12):3688-3696. doi: 10.1021/acs.jcim.3c00132. Epub 2023 Jun 9.

Effect of missing data on multitask prediction methods.缺失数据对多任务预测方法的影响。

J Cheminform. 2018 May 22;10(1):26. doi: 10.1186/s13321-018-0281-z.

Is Multitask Deep Learning Practical for Pharma?多任务深度学习对制药行业是否实用？

J Chem Inf Model. 2017 Aug 28;57(8):2068-2076. doi: 10.1021/acs.jcim.7b00146. Epub 2017 Aug 1.

A Multitask Approach to Learn Molecular Properties.一种学习分子性质的多任务方法。

J Chem Inf Model. 2021 Aug 23;61(8):3824-3834. doi: 10.1021/acs.jcim.1c00646. Epub 2021 Jul 21.

Prediction of Human Cytochrome P450 Inhibition Using a Multitask Deep Autoencoder Neural Network.利用多任务深度自动编码器神经网络预测人细胞色素 P450 抑制作用。

Mol Pharm. 2018 Oct 1;15(10):4336-4345. doi: 10.1021/acs.molpharmaceut.8b00110. Epub 2018 May 30.

Adding Stochastic Negative Examples into Machine Learning Improves Molecular Bioactivity Prediction.向机器学习中添加随机负例可提高分子生物活性预测。

J Chem Inf Model. 2020 Dec 28;60(12):5957-5970. doi: 10.1021/acs.jcim.0c00565. Epub 2020 Nov 27.

Multitask deep learning with dynamic task balancing for quantum mechanical properties prediction.用于量子力学性质预测的具有动态任务平衡的多任务深度学习。

Phys Chem Chem Phys. 2022 Mar 2;24(9):5383-5393. doi: 10.1039/d1cp05172e.

An Interpretable Multitask Framework BiLAT Enables Accurate Prediction of Cyclin-Dependent Protein Kinase Inhibitors.可解释的多任务框架 BiLAT 可实现对细胞周期蛋白依赖性激酶抑制剂的准确预测。

J Chem Inf Model. 2023 Jun 12;63(11):3350-3368. doi: 10.1021/acs.jcim.3c00473. Epub 2023 May 12.

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks.使用多任务卷积神经网络从自由文本病理报告中自动提取癌症登记报告信息。

J Am Med Inform Assoc. 2020 Jan 1;27(1):89-98. doi: 10.1093/jamia/ocz153.

Prediction of kinase inhibitors binding modes with machine learning and reduced descriptor sets.基于机器学习和简化描述符集预测激酶抑制剂结合模式。

Sci Rep. 2021 Jan 12;11(1):706. doi: 10.1038/s41598-020-80758-4.

引用本文的文献

Toward Assay-Aware Bioactivity Model(er)s: Getting a Grip on Biological Context.迈向可感知分析的生物活性模型：把握生物学背景。

J Chem Inf Model. 2025 Jul 14;65(13):7013-7023. doi: 10.1021/acs.jcim.5c00603. Epub 2025 Jun 30.

QSPRpred: a Flexible Open-Source Quantitative Structure-Property Relationship Modelling Tool.QSPRpred：一个灵活的开源定量结构-性质关系建模工具。

J Cheminform. 2024 Nov 14;16(1):128. doi: 10.1186/s13321-024-00908-y.

Graph neural processes for molecules: an evaluation on docking scores and strategies to improve generalization.用于分子的图神经过程：对接分数评估及提高泛化能力的策略

J Cheminform. 2024 Oct 23;16(1):115. doi: 10.1186/s13321-024-00904-2.

Descriptor-augmented machine learning for enzyme-chemical interaction predictions.用于酶-化学相互作用预测的描述符增强机器学习

Synth Syst Biotechnol. 2024 Feb 28;9(2):259-268. doi: 10.1016/j.synbio.2024.02.006. eCollection 2024 Jun.

Large-scale comparison of machine learning methods for profiling prediction of kinase inhibitors.激酶抑制剂谱预测的机器学习方法大规模比较

J Cheminform. 2024 Jan 30;16(1):13. doi: 10.1186/s13321-023-00799-5.

Poor Generalization by Current Deep Learning Models for Predicting Binding Affinities of Kinase Inhibitors.当前用于预测激酶抑制剂结合亲和力的深度学习模型泛化能力较差。

bioRxiv. 2023 Sep 6:2023.09.04.556234. doi: 10.1101/2023.09.04.556234.

本文引用的文献

Papyrus: a large-scale curated dataset aimed at bioactivity predictions.纸莎草纸：一个旨在进行生物活性预测的大规模精选数据集。

J Cheminform. 2023 Jan 6;15(1):3. doi: 10.1186/s13321-022-00672-x.

Global Analysis of Deep Learning Prediction Using Large-Scale In-House Kinome-Wide Profiling Data.使用大规模内部激酶组全谱分析数据进行深度学习预测的全局分析

ACS Omega. 2022 May 23;7(22):18374-18381. doi: 10.1021/acsomega.2c00664. eCollection 2022 Jun 7.

Analysis of the benefits of imputation models over traditional QSAR models for toxicity prediction.插补模型相对于传统定量构效关系（QSAR）模型在毒性预测方面的优势分析。

J Cheminform. 2022 Jun 7;14(1):32. doi: 10.1186/s13321-022-00611-w.

Active Site Sequence Representations of Human Kinases Outperform Full Sequence Representations for Affinity Prediction and Inhibitor Generation: 3D Effects in a 1D Model.人源激酶活性位点序列表示优于全序列表示，可用于亲和力预测和抑制剂生成：1D 模型中的 3D 效应。

J Chem Inf Model. 2022 Jan 24;62(2):240-257. doi: 10.1021/acs.jcim.1c00889. Epub 2021 Dec 14.

Crowdsourced mapping of unexplored target space of kinase inhibitors.激酶抑制剂未探索靶标空间的众包绘图。

Nat Commun. 2021 Jun 3;12(1):3307. doi: 10.1038/s41467-021-23165-1.

Kinase drug discovery 20 years after imatinib: progress and future directions.伊马替尼发现 20 年后的激酶药物研发：进展与未来方向

Nat Rev Drug Discov. 2021 Jul;20(7):551-569. doi: 10.1038/s41573-021-00195-4. Epub 2021 May 17.

Proteochemometrics - recent developments in bioactivity and selectivity modeling.药物化学计量学——生物活性和选择性建模的最新进展。

Drug Discov Today Technol. 2019 Dec;32-33:89-98. doi: 10.1016/j.ddtec.2020.08.003. Epub 2020 Sep 20.

Multi-task learning models for predicting active compounds.用于预测活性化合物的多任务学习模型。

J Biomed Inform. 2020 Aug;108:103484. doi: 10.1016/j.jbi.2020.103484. Epub 2020 Jun 29.

QSAR without borders.无边界定量构效关系。

Chem Soc Rev. 2020 Jun 7;49(11):3525-3564. doi: 10.1039/d0cs00098a. Epub 2020 May 1.

New Promise and Opportunities for Allosteric Kinase Inhibitors.变构激酶抑制剂的新希望和新机遇。

Angew Chem Int Ed Engl. 2020 Aug 10;59(33):13764-13776. doi: 10.1002/anie.201914525. Epub 2020 Apr 1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

大规模稀疏蛋白激酶活性数据建模。

Large-Scale Modeling of Sparse Protein Kinase Activity Data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献