利用部分标记的噪声学生自训练和自监督图嵌入探索化学空间。

Exploration of chemical space with partial labeled noisy student self-training and self-supervised graph embedding.

机构信息

Department of Computer Science, Hunter College, The City University of New York, 695 Park Ave, New York, NY, 10065, USA.

The Graduate Center, The City University of New York, 356 5th Ave, New York, NY, 10016, USA.

出版信息

BMC Bioinformatics. 2022 May 2;23(Suppl 3):158. doi: 10.1186/s12859-022-04681-3.

DOI:10.1186/s12859-022-04681-3

PMID:35501680

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9063120/

Abstract

BACKGROUND

Drug discovery is time-consuming and costly. Machine learning, especially deep learning, shows great potential in quantitative structure-activity relationship (QSAR) modeling to accelerate drug discovery process and reduce its cost. A big challenge in developing robust and generalizable deep learning models for QSAR is the lack of a large amount of data with high-quality and balanced labels. To address this challenge, we developed a self-training method, Partially LAbeled Noisy Student (PLANS), and a novel self-supervised graph embedding, Graph-Isomorphism-Network Fingerprint (GINFP), for chemical compounds representations with substructure information using unlabeled data. The representations can be used for predicting chemical properties such as binding affinity, toxicity, and others. PLANS-GINFP allows us to exploit millions of unlabeled chemical compounds as well as labeled and partially labeled pharmacological data to improve the generalizability of neural network models.

RESULTS

We evaluated the performance of PLANS-GINFP for predicting Cytochrome P450 (CYP450) binding activity in a CYP450 dataset and chemical toxicity in the Tox21 dataset. The extensive benchmark studies demonstrated that PLANS-GINFP could significantly improve the performance in both cases by a large margin. Both PLANS-based self-training and GINFP-based self-supervised learning contribute to the performance improvement.

CONCLUSION

To better exploit chemical structures as an input for machine learning algorithms, we proposed a self-supervised graph neural network-based embedding method that can encode substructure information. Furthermore, we developed a model agnostic self-training method, PLANS, that can be applied to any deep learning architectures to improve prediction accuracies. PLANS provided a way to better utilize partially labeled and unlabeled data. Comprehensive benchmark studies demonstrated their potentials in predicting drug metabolism and toxicity profiles using sparse, noisy, and imbalanced data. PLANS-GINFP could serve as a general solution to improve the predictive modeling for QSAR modeling.

摘要

背景

药物发现是一个耗时且昂贵的过程。机器学习，尤其是深度学习，在定量构效关系（QSAR）建模方面显示出了巨大的潜力，可以加速药物发现过程并降低成本。为 QSAR 开发稳健且可推广的深度学习模型的一个主要挑战是缺乏具有高质量和平衡标签的大量数据。为了解决这个挑战，我们开发了一种自训练方法，即部分标记有噪声的学生（PLANS），以及一种新的自监督图嵌入方法，即图同构网络指纹（GINFP），用于具有子结构信息的化学化合物表示，可以使用无标签数据来预测结合亲和力、毒性等化学性质。PLANS-GINFP 允许我们利用数百万个未标记的化学化合物以及标记和部分标记的药理学数据来提高神经网络模型的泛化能力。

结果

我们评估了 PLANS-GINFP 在 CYP450 数据集和 Tox21 数据集的细胞色素 P450（CYP450）结合活性和化学毒性预测方面的性能。广泛的基准研究表明，PLANS-GINFP 可以在这两种情况下显著提高性能，且提高幅度很大。基于 PLANS 的自训练和基于 GINFP 的自监督学习都有助于提高性能。

结论

为了更好地将化学结构作为机器学习算法的输入，我们提出了一种基于自监督图神经网络的嵌入方法，可以编码子结构信息。此外，我们开发了一种模型不可知的自训练方法 PLANS，可以应用于任何深度学习架构，以提高预测精度。PLANS 提供了一种更好地利用部分标记和未标记数据的方法。综合基准研究表明，它们在使用稀疏、嘈杂和不平衡的数据预测药物代谢和毒性特征方面具有潜力。PLANS-GINFP 可以作为一种通用解决方案，用于提高 QSAR 建模的预测建模能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a0d3/9063120/de3512f69c0a/12859_2022_4681_Fig1_HTML.jpg

相似文献

Exploration of chemical space with partial labeled noisy student self-training and self-supervised graph embedding.利用部分标记的噪声学生自训练和自监督图嵌入探索化学空间。

BMC Bioinformatics. 2022 May 2;23(Suppl 3):158. doi: 10.1186/s12859-022-04681-3.

Deep semi-supervised learning via dynamic anchor graph embedding in latent space.基于潜在空间动态锚图嵌入的深度半监督学习。

Neural Netw. 2022 Feb;146:350-360. doi: 10.1016/j.neunet.2021.11.026. Epub 2021 Dec 1.

An effective self-supervised framework for learning expressive molecular global representations to drug discovery.用于药物发现的学习表达性分子全局表示的有效自监督框架。

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab109.

Artificial intelligence to deep learning: machine intelligence approach for drug discovery.人工智能到深度学习：药物发现的机器智能方法。

Mol Divers. 2021 Aug;25(3):1315-1360. doi: 10.1007/s11030-021-10217-3. Epub 2021 Apr 12.

A unified deep semi-supervised graph learning scheme based on nodes re-weighting and manifold regularization.一种基于节点重新加权和流形正则化的统一深度半监督图学习方案。

Neural Netw. 2023 Jan;158:188-196. doi: 10.1016/j.neunet.2022.11.017. Epub 2022 Nov 19.

Efficient Combination of CNN and Transformer for Dual-Teacher Uncertainty-guided Semi-supervised Medical Image Segmentation.基于 CNN 和 Transformer 的高效组合用于双教师不确定性引导的半监督医学图像分割。

Comput Methods Programs Biomed. 2022 Nov;226:107099. doi: 10.1016/j.cmpb.2022.107099. Epub 2022 Sep 2.

Graph-Based Self-Training for Semi-Supervised Deep Similarity Learning.基于图的自训练在半监督深度相似性学习中的应用。

Sensors (Basel). 2023 Apr 13;23(8):3944. doi: 10.3390/s23083944.

Deep virtual adversarial self-training with consistency regularization for semi-supervised medical image classification.深度对偶对抗自训练与一致性正则化在半监督医学图像分类中的应用。

Med Image Anal. 2021 May;70:102010. doi: 10.1016/j.media.2021.102010. Epub 2021 Feb 22.

Robust Semi-Supervised Traffic Sign Recognition via Self-Training and Weakly-Supervised Learning.基于自训练和弱监督学习的鲁棒半监督交通标志识别。

Sensors (Basel). 2020 May 8;20(9):2684. doi: 10.3390/s20092684.

Self-Supervised Feature Learning and Phenotyping for Assessing Age-Related Macular Degeneration Using Retinal Fundus Images.使用视网膜眼底图像评估年龄相关性黄斑变性的自监督特征学习和表型分析。

Ophthalmol Retina. 2022 Feb;6(2):116-129. doi: 10.1016/j.oret.2021.06.010. Epub 2021 Jul 2.

引用本文的文献

E-GuARD: expert-guided augmentation for the robust detection of compounds interfering with biological assays.E-GuARD：用于可靠检测干扰生物测定的化合物的专家指导增强方法

J Cheminform. 2025 Apr 29;17(1):64. doi: 10.1186/s13321-025-01014-3.

Towards automatic farrowing monitoring-A Noisy Student approach for improving detection performance of newborn piglets.迈向自动分娩监测——一种用于提高新生仔猪检测性能的噪声学生方法

PLoS One. 2024 Oct 2;19(10):e0310818. doi: 10.1371/journal.pone.0310818. eCollection 2024.

Semi-supervised meta-learning elucidates understudied molecular interactions.半监督元学习阐明了研究不足的分子相互作用。

Commun Biol. 2024 Sep 9;7(1):1104. doi: 10.1038/s42003-024-06797-z.

Hierarchical multi-omics data integration and modeling predict cell-specific chemical proteomics and drug responses.层次化多组学数据整合和建模预测细胞特异性化学蛋白质组学和药物反应。

Cell Rep Methods. 2023 Apr 17;3(4):100452. doi: 10.1016/j.crmeth.2023.100452. eCollection 2023 Apr 24.

End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins.端到端序列-结构-功能元学习预测全基因组化学-蛋白质相互作用的暗蛋白质。

PLoS Comput Biol. 2023 Jan 18;19(1):e1010851. doi: 10.1371/journal.pcbi.1010851. eCollection 2023 Jan.

本文引用的文献

COVID-19 Multi-Targeted Drug Repurposing Using Few-Shot Learning.利用少样本学习进行COVID-19多靶点药物重新利用

Front Bioinform. 2021 Jun 15;1:693177. doi: 10.3389/fbinf.2021.693177. eCollection 2021.

LaplaceNet: A Hybrid Graph-Energy Neural Network for Deep Semisupervised Classification.拉普拉斯网络：一种用于深度半监督分类的混合图能量神经网络。

IEEE Trans Neural Netw Learn Syst. 2024 Apr;35(4):5306-5318. doi: 10.1109/TNNLS.2022.3203315. Epub 2024 Apr 4.

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.基于结构-活性关系的高度不平衡Tox21数据集的化学分类

J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.

A Comprehensive Survey on Graph Neural Networks.图神经网络综述。

IEEE Trans Neural Netw Learn Syst. 2021 Jan;32(1):4-24. doi: 10.1109/TNNLS.2020.2978386. Epub 2021 Jan 4.

ChEMBL: towards direct deposition of bioassay data.ChEMBL：致力于直接生成生物测定数据。

Nucleic Acids Res. 2019 Jan 8;47(D1):D930-D940. doi: 10.1093/nar/gky1075.

MoleculeNet: a benchmark for molecular machine learning.分子网络：分子机器学习的一个基准

Chem Sci. 2017 Oct 31;9(2):513-530. doi: 10.1039/c7sc02664a. eCollection 2018 Jan 14.

The US Federal Tox21 Program: A strategic and operational plan for continued leadership.美国联邦毒物学计划 21：持续领导的战略和行动计划。

ALTEX. 2018;35(2):163-168. doi: 10.14573/altex.1803011. Epub 2018 Mar 8.

The rise of deep learning in drug discovery.深度学习在药物发现中的崛起。

Drug Discov Today. 2018 Jun;23(6):1241-1250. doi: 10.1016/j.drudis.2018.01.039. Epub 2018 Jan 31.

Basic review of the cytochrome p450 system.细胞色素P450系统基础综述。

J Adv Pract Oncol. 2013 Jul;4(4):263-8. doi: 10.6004/jadpro.2013.4.4.7.

Scaffold hopping.骨架跃迁

Drug Discov Today Technol. 2004 Dec;1(3):217-24. doi: 10.1016/j.ddtec.2004.10.009.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用部分标记的噪声学生自训练和自监督图嵌入探索化学空间。

Exploration of chemical space with partial labeled noisy student self-training and self-supervised graph embedding.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献