用于机器学习任务的可解释分子编码和表示。

Interpretable molecular encodings and representations for machine learning tasks.

作者信息

Weckbecker Moritz, Anžel Aleksandar, Yang Zewen, Hattab Georges

机构信息

Center for Artificial Intelligence in Public Health Research, (ZKI-PH), Robert Koch Institute, Nordufer 20, Berlin, 13353, Berlin, Germany.

Department of Mathematics and Computer science Freie Universität, Arnimallee 14, Berlin, 14195, Berlin, Germany.

出版信息

Comput Struct Biotechnol J. 2024 May 24;23:2326-2336. doi: 10.1016/j.csbj.2024.05.035. eCollection 2024 Dec.

DOI:10.1016/j.csbj.2024.05.035

PMID:38867722

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11167246/

Abstract

Molecular encodings and their usage in machine learning models have demonstrated significant breakthroughs in biomedical applications, particularly in the classification of peptides and proteins. To this end, we propose a new encoding method: Interpretable Carbon-based Array of Neighborhoods (iCAN). Designed to address machine learning models' need for more structured and less flexible input, it captures the neighborhoods of carbon atoms in a counting array and improves the utility of the resulting encodings for machine learning models. The iCAN method provides interpretable molecular encodings and representations, enabling the comparison of molecular neighborhoods, identification of repeating patterns, and visualization of relevance heat maps for a given data set. When reproducing a large biomedical peptide classification study, it outperforms its predecessor encoding. When extended to proteins, it outperforms a lead structure-based encoding on 71% of the data sets. Our method offers interpretable encodings that can be applied to all organic molecules, including exotic amino acids, cyclic peptides, and larger proteins, making it highly versatile across various domains and data sets. This work establishes a promising new direction for machine learning in peptide and protein classification in biomedicine and healthcare, potentially accelerating advances in drug discovery and disease diagnosis.

摘要

分子编码及其在机器学习模型中的应用已在生物医学应用中取得了重大突破，尤其是在肽和蛋白质的分类方面。为此，我们提出了一种新的编码方法：可解释的基于碳的邻域阵列（iCAN）。该方法旨在满足机器学习模型对更结构化、灵活性更低的输入的需求，它在计数阵列中捕获碳原子的邻域信息，并提高了所得编码在机器学习模型中的效用。iCAN方法提供了可解释的分子编码和表示，能够比较分子邻域、识别重复模式，并为给定数据集可视化相关性热图。在重现一项大型生物医学肽分类研究时，它优于其前身编码。当扩展到蛋白质时，在71%的数据集上它优于基于主结构的编码。我们的方法提供了可解释的编码，可应用于所有有机分子，包括外来氨基酸、环肽和更大的蛋白质，使其在各个领域和数据集上都具有高度通用性。这项工作为生物医学和医疗保健中肽和蛋白质分类的机器学习确立了一个有前景的新方向，有可能加速药物发现和疾病诊断的进展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33de/11167246/3099519f83c9/gr001.jpg

相似文献

Interpretable molecular encodings and representations for machine learning tasks.用于机器学习任务的可解释分子编码和表示。

Comput Struct Biotechnol J. 2024 May 24;23:2326-2336. doi: 10.1016/j.csbj.2024.05.035. eCollection 2024 Dec.

A parametric approach for molecular encodings using multilevel atomic neighborhoods applied to peptide classification.一种使用多级原子邻域进行分子编码的参数化方法应用于肽分类。

NAR Genom Bioinform. 2023 Jan 10;5(1):lqac103. doi: 10.1093/nargab/lqac103. eCollection 2023 Mar.

A large-scale comparative study on peptide encodings for biomedical classification.一项关于生物医学分类中肽编码的大规模比较研究。

NAR Genom Bioinform. 2021 May 22;3(2):lqab039. doi: 10.1093/nargab/lqab039. eCollection 2021 Jun.

Encodings and models for antimicrobial peptide classification for multi-resistant pathogens.用于多重耐药病原体抗菌肽分类的编码与模型

BioData Min. 2019 Mar 4;12:7. doi: 10.1186/s13040-019-0196-x. eCollection 2019.

Classification of battery compounds using structure-free Mendeleev encodings.使用无结构门捷列夫编码对电池化合物进行分类。

J Cheminform. 2024 Apr 26;16(1):47. doi: 10.1186/s13321-024-00836-x.

Gene-based microbiome representation enhances host phenotype classification.基于基因的微生物组表示增强了宿主表型分类。

mSystems. 2023 Aug 31;8(4):e0053123. doi: 10.1128/msystems.00531-23. Epub 2023 Jul 5.

R.ROSETTA: an interpretable machine learning framework.R.ROSETTA：一个可解释的机器学习框架。

BMC Bioinformatics. 2021 Mar 6;22(1):110. doi: 10.1186/s12859-021-04049-z.

Importance of Engineered and Learned Molecular Representations in Predicting Organic Reactivity, Selectivity, and Chemical Properties.在预测有机反应性、选择性和化学性质方面，工程化和学习的分子表示的重要性。

Acc Chem Res. 2021 Feb 16;54(4):827-836. doi: 10.1021/acs.accounts.0c00745. Epub 2021 Feb 3.

A Machine Learning Approach with Human-AI Collaboration for Automated Classification of Patient Safety Event Reports: Algorithm Development and Validation Study.一种人机协作的机器学习方法用于患者安全事件报告的自动分类：算法开发与验证研究

JMIR Hum Factors. 2024 Jan 25;11:e53378. doi: 10.2196/53378.

Interpretable machine learning methods for predictions in systems biology from omics data.用于基于组学数据的系统生物学预测的可解释机器学习方法。

Front Mol Biosci. 2022 Oct 17;9:926623. doi: 10.3389/fmolb.2022.926623. eCollection 2022.

引用本文的文献

Identification of amino acid metabolism‑related genes as diagnostic and prognostic biomarkers in sepsis through machine learning.通过机器学习鉴定氨基酸代谢相关基因作为脓毒症的诊断和预后生物标志物

Exp Ther Med. 2024 Dec 20;29(2):36. doi: 10.3892/etm.2024.12786. eCollection 2025 Feb.

本文引用的文献

CycPeptMPDB: A Comprehensive Database of Membrane Permeability of Cyclic Peptides.CycPeptMPDB：一个关于环状肽膜通透性的综合数据库。

J Chem Inf Model. 2023 Apr 10;63(7):2240-2250. doi: 10.1021/acs.jcim.2c01573. Epub 2023 Mar 17.

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

AI4AVP: an antiviral peptides predictor in deep learning approach with generative adversarial network data augmentation.AI4AVP：一种采用生成对抗网络数据增强的深度学习方法的抗病毒肽预测器。

Bioinform Adv. 2022 Oct 26;2(1):vbac080. doi: 10.1093/bioadv/vbac080. eCollection 2022.

NAR Genom Bioinform. 2023 Jan 10;5(1):lqac103. doi: 10.1093/nargab/lqac103. eCollection 2023 Mar.

Evaluating molecular representations in machine learning models for drug response prediction and interpretability.评估机器学习模型中的分子表示在药物反应预测和可解释性方面的应用。

J Integr Bioinform. 2022 Aug 26;19(3). doi: 10.1515/jib-2022-0006. eCollection 2022 Sep 1.

ToxinPred2: an improved method for predicting toxicity of proteins.ToxinPred2：一种改进的蛋白质毒性预测方法。

Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac174.

EMBER-Embedding Multiple Molecular Fingerprints for Virtual Screening.EMBER-嵌入多种分子指纹进行虚拟筛选。

Int J Mol Sci. 2022 Feb 15;23(4):2156. doi: 10.3390/ijms23042156.

Using molecular embeddings in QSAR modeling: does it make a difference?在定量构效关系建模中使用分子嵌入：有区别吗？

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab365.

Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

Evaluation of in silico tools for the prediction of protein and peptide aggregation on diverse datasets.评估不同数据集上用于预测蛋白质和肽聚集的计算工具。

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab240.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于机器学习任务的可解释分子编码和表示。

Interpretable molecular encodings and representations for machine learning tasks.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献