Weckbecker Moritz, Anžel Aleksandar, Yang Zewen, Hattab Georges
Center for Artificial Intelligence in Public Health Research, (ZKI-PH), Robert Koch Institute, Nordufer 20, Berlin, 13353, Berlin, Germany.
Department of Mathematics and Computer science Freie Universität, Arnimallee 14, Berlin, 14195, Berlin, Germany.
Comput Struct Biotechnol J. 2024 May 24;23:2326-2336. doi: 10.1016/j.csbj.2024.05.035. eCollection 2024 Dec.
Molecular encodings and their usage in machine learning models have demonstrated significant breakthroughs in biomedical applications, particularly in the classification of peptides and proteins. To this end, we propose a new encoding method: Interpretable Carbon-based Array of Neighborhoods (iCAN). Designed to address machine learning models' need for more structured and less flexible input, it captures the neighborhoods of carbon atoms in a counting array and improves the utility of the resulting encodings for machine learning models. The iCAN method provides interpretable molecular encodings and representations, enabling the comparison of molecular neighborhoods, identification of repeating patterns, and visualization of relevance heat maps for a given data set. When reproducing a large biomedical peptide classification study, it outperforms its predecessor encoding. When extended to proteins, it outperforms a lead structure-based encoding on 71% of the data sets. Our method offers interpretable encodings that can be applied to all organic molecules, including exotic amino acids, cyclic peptides, and larger proteins, making it highly versatile across various domains and data sets. This work establishes a promising new direction for machine learning in peptide and protein classification in biomedicine and healthcare, potentially accelerating advances in drug discovery and disease diagnosis.
分子编码及其在机器学习模型中的应用已在生物医学应用中取得了重大突破,尤其是在肽和蛋白质的分类方面。为此,我们提出了一种新的编码方法:可解释的基于碳的邻域阵列(iCAN)。该方法旨在满足机器学习模型对更结构化、灵活性更低的输入的需求,它在计数阵列中捕获碳原子的邻域信息,并提高了所得编码在机器学习模型中的效用。iCAN方法提供了可解释的分子编码和表示,能够比较分子邻域、识别重复模式,并为给定数据集可视化相关性热图。在重现一项大型生物医学肽分类研究时,它优于其前身编码。当扩展到蛋白质时,在71%的数据集上它优于基于主结构的编码。我们的方法提供了可解释的编码,可应用于所有有机分子,包括外来氨基酸、环肽和更大的蛋白质,使其在各个领域和数据集上都具有高度通用性。这项工作为生物医学和医疗保健中肽和蛋白质分类的机器学习确立了一个有前景的新方向,有可能加速药物发现和疾病诊断的进展。