• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于深度学习的定量构效活性分类的数据平衡。

Balancing Data on Deep Learning-Based Proteochemometric Activity Classification.

机构信息

B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028 Barcelona, Spain.

Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Déu, 08950 Esplugues de Llobregat, Spain.

出版信息

J Chem Inf Model. 2021 Apr 26;61(4):1657-1669. doi: 10.1021/acs.jcim.1c00086. Epub 2021 Mar 29.

DOI:10.1021/acs.jcim.1c00086
PMID:33779173
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8594867/
Abstract

In silico analysis of biological activity data has become an essential technique in pharmaceutical development. Specifically, the so-called proteochemometric models aim to share information between targets in machine learning ligand-target activity prediction models. However, bioactivity data sets used in proteochemometric modeling are usually imbalanced, which could potentially affect the performance of the models. In this work, we explored the effect of different balancing strategies in deep learning proteochemometric target-compound activity classification models while controlling for the compound series bias through clustering. These strategies were (1) no_resampling, (2) resampling_after_clustering, (3) resampling_before_clustering, and (4) semi_resampling. These schemas were evaluated in kinases, GPCRs, nuclear receptors, and proteases from BindingDB. We observed that the predicted proportion of positives was driven by the actual data balance in the test set. Additionally, it was confirmed that data balance had an impact on the performance estimates of the proteochemometric model. We recommend a combination of data augmentation and clustering in the training set (semi_resampling) to mitigate the data imbalance effect in a realistic scenario. The code of this analysis is publicly available at https://github.com/b2slab/imbalance_pcm_benchmark.

摘要

基于计算机的生物活性数据分析已成为药物开发的一项重要技术。具体来说,所谓的“定量构效关系-化学计量学模型”旨在通过机器学习配体-靶标活性预测模型共享靶标之间的信息。然而,定量构效关系建模中使用的生物活性数据集通常是不平衡的,这可能会影响模型的性能。在这项工作中,我们通过聚类来控制化合物系列偏差,探索了不同平衡策略对深度学习定量构效关系靶标-化合物活性分类模型的影响。这些策略是:(1)不重采样,(2)聚类后重采样,(3)聚类前重采样,和(4)半重采样。我们在 BindingDB 中的激酶、GPCR、核受体和蛋白酶中评估了这些方案。我们观察到,预测阳性的比例受测试集中实际数据平衡的驱动。此外,还证实数据平衡对定量构效关系模型的性能估计有影响。我们建议在训练集中结合数据增强和聚类(半重采样),以减轻现实场景中数据不平衡的影响。该分析的代码可在 https://github.com/b2slab/imbalance_pcm_benchmark 上公开获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/5de4a62d1be5/ci1c00086_0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/182e2c956bc4/ci1c00086_0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/78b077609127/ci1c00086_0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/17274da03256/ci1c00086_0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/5e51a93df8c9/ci1c00086_0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/25f709f9de9d/ci1c00086_0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/60b0c3023342/ci1c00086_0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/5de4a62d1be5/ci1c00086_0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/182e2c956bc4/ci1c00086_0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/78b077609127/ci1c00086_0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/17274da03256/ci1c00086_0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/5e51a93df8c9/ci1c00086_0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/25f709f9de9d/ci1c00086_0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/60b0c3023342/ci1c00086_0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2857/8594867/5de4a62d1be5/ci1c00086_0008.jpg

相似文献

1
Balancing Data on Deep Learning-Based Proteochemometric Activity Classification.基于深度学习的定量构效活性分类的数据平衡。
J Chem Inf Model. 2021 Apr 26;61(4):1657-1669. doi: 10.1021/acs.jcim.1c00086. Epub 2021 Mar 29.
2
Comparative Studies on Resampling Techniques in Machine Learning and Deep Learning Models for Drug-Target Interaction Prediction.机器学习和深度学习模型中用于药物-靶标相互作用预测的重采样技术的比较研究。
Molecules. 2023 Feb 9;28(4):1663. doi: 10.3390/molecules28041663.
3
QPoweredCompound2DeNovoDrugPropMax - a novel programmatic tool incorporating deep learning and methods for automated in silico bio-activity discovery for any compound of interest.QPoweredCompound2DeNovoDrugPropMax——一种新颖的编程工具,融合深度学习和方法,可对任何感兴趣的化合物进行自动化的计算机虚拟生物活性发现。
J Biomol Struct Dyn. 2023 Mar;41(5):1790-1797. doi: 10.1080/07391102.2021.2024450. Epub 2022 Jan 10.
4
How to approach machine learning-based prediction of drug/compound-target interactions.如何进行基于机器学习的药物/化合物-靶点相互作用预测。
J Cheminform. 2023 Feb 6;15(1):16. doi: 10.1186/s13321-023-00689-w.
5
The recent progress in proteochemometric modelling: focusing on target descriptors, cross-term descriptors and application scope.蛋白质化学计量学建模的最新进展:聚焦于目标描述符、交叉项描述符及应用范围。
Brief Bioinform. 2017 Jan;18(1):125-136. doi: 10.1093/bib/bbw004. Epub 2016 Feb 11.
6
The Effect of Resampling on Data-imbalanced Conditions for Prediction towards Nuclear Receptor Profiling Using Deep Learning.基于深度学习的核受体谱分析中重采样对数据不平衡预测的影响。
Mol Inform. 2020 Aug;39(8):e1900131. doi: 10.1002/minf.201900131. Epub 2020 Mar 31.
7
Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling.无偏描述符和参数选择证实了蛋白质化学计量学建模的潜力。
BMC Bioinformatics. 2005 Mar 10;6:50. doi: 10.1186/1471-2105-6-50.
8
Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases.深度学习和机器智能在计算机药物发现中的最新应用:方法、工具和数据库。
Brief Bioinform. 2019 Sep 27;20(5):1878-1912. doi: 10.1093/bib/bby061.
9
MDeePred: novel multi-channel protein featurization for deep learning-based binding affinity prediction in drug discovery.MDeePred:用于药物发现中基于深度学习的结合亲和力预测的新型多通道蛋白质特征化。
Bioinformatics. 2021 May 5;37(5):693-704. doi: 10.1093/bioinformatics/btaa858.
10
Deep Learning-Based Imbalanced Data Classification for Drug Discovery.基于深度学习的药物发现中不平衡数据分类。
J Chem Inf Model. 2020 Sep 28;60(9):4180-4190. doi: 10.1021/acs.jcim.9b01162. Epub 2020 Jul 8.

引用本文的文献

1
QSPRpred: a Flexible Open-Source Quantitative Structure-Property Relationship Modelling Tool.QSPRpred:一个灵活的开源定量结构-性质关系建模工具。
J Cheminform. 2024 Nov 14;16(1):128. doi: 10.1186/s13321-024-00908-y.
2
A multi-center big-data approach for precise PICC-RVT prognosis and identification of major risk factors in clinical practice.一种用于临床实践中精准预测经外周静脉穿刺中心静脉置管相关血栓形成(PICC-RVT)及识别主要危险因素的多中心大数据方法。
Heliyon. 2024 Oct 12;10(20):e39178. doi: 10.1016/j.heliyon.2024.e39178. eCollection 2024 Oct 30.
3
Machine Learning Methods for Small Data Challenges in Molecular Science.

本文引用的文献

1
Proteochemometrics - recent developments in bioactivity and selectivity modeling.药物化学计量学——生物活性和选择性建模的最新进展。
Drug Discov Today Technol. 2019 Dec;32-33:89-98. doi: 10.1016/j.ddtec.2020.08.003. Epub 2020 Sep 20.
2
Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction.序列填充对深度学习模型在古菌蛋白功能预测中的性能的影响。
Sci Rep. 2020 Sep 3;10(1):14634. doi: 10.1038/s41598-020-71450-8.
3
Deep Learning-Based Imbalanced Data Classification for Drug Discovery.
机器学习方法在分子科学中小数据挑战中的应用。
Chem Rev. 2023 Jul 12;123(13):8736-8780. doi: 10.1021/acs.chemrev.3c00189. Epub 2023 Jun 29.
基于深度学习的药物发现中不平衡数据分类。
J Chem Inf Model. 2020 Sep 28;60(9):4180-4190. doi: 10.1021/acs.jcim.9b01162. Epub 2020 Jul 8.
4
Novel Consensus Architecture To Improve Performance of Large-Scale Multitask Deep Learning QSAR Models.新型共识架构可提高大规模多任务深度学习 QSAR 模型的性能。
J Chem Inf Model. 2019 Nov 25;59(11):4613-4624. doi: 10.1021/acs.jcim.9b00526. Epub 2019 Oct 25.
5
Benchmarking network propagation methods for disease gene identification.用于疾病基因识别的网络传播方法的基准测试。
PLoS Comput Biol. 2019 Sep 3;15(9):e1007276. doi: 10.1371/journal.pcbi.1007276. eCollection 2019 Sep.
6
Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery.人工智能在计算机辅助药物发现中的概念。
Chem Rev. 2019 Sep 25;119(18):10520-10594. doi: 10.1021/acs.chemrev.8b00728. Epub 2019 Jul 11.
7
DeepAffinity: interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networks.DeepAffinity:通过统一的递归和卷积神经网络实现化合物-蛋白质亲和力的可解释深度学习。
Bioinformatics. 2019 Sep 15;35(18):3329-3338. doi: 10.1093/bioinformatics/btz111.
8
Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning.基于深度学习的序列结合预测中交叉验证策略的评估。
J Chem Inf Model. 2019 Apr 22;59(4):1645-1657. doi: 10.1021/acs.jcim.8b00663. Epub 2019 Feb 22.
9
UniProt: a worldwide hub of protein knowledge.UniProt:蛋白质知识的全球枢纽。
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515. doi: 10.1093/nar/gky1049.
10
PubChem 2019 update: improved access to chemical data.PubChem 2019 年更新:改善化学数据获取。
Nucleic Acids Res. 2019 Jan 8;47(D1):D1102-D1109. doi: 10.1093/nar/gky1033.