分子表示的拓扑结构及其对机器学习性能的影响。

The topology of molecular representations and its influence on machine learning performance.

作者信息

Rottach Florian, Schieferdecker Sebastian, Eickhoff Carsten

机构信息

Central Data Science, Boehringer Ingelheim GmbH, Biberach/Riss, Germany.

School of Medicine, University of Tübingen, Tübingen, Germany.

出版信息

J Cheminform. 2025 Jul 21;17(1):109. doi: 10.1186/s13321-025-01045-w.

DOI:10.1186/s13321-025-01045-w

PMID:40691856

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12281805/

Abstract

Advancements in cheminformatics have led to numerous methods for encoding molecules numerically. The choice of molecular representation impacts the accuracy and generalizability of learning algorithms applied to chemical datasets. Designing and selecting the appropriate representation often lacks a systematic approach and follows computationally exhaustive empirical testing. Moreover, research has shown that deep learning models do not substantially outperform traditional approaches across many tasks with no clear explanation for this shortfall. In this work, we present TopoLearn, a model that predicts the effectiveness of representations on datasets based on the topological characteristics of the corresponding feature space. Using interpretability techniques, we find that persistent homology descriptors are linked with the error metrics of trained machine learning models, offering a new method to better understand and select molecular representations.Scientific contribution Our research is the first to establish an empirical connection between the topology of feature spaces and the machine learning performance of molecular representations. In addition, we facilitate future research endeavors by providing open access to our developed model.

摘要

化学信息学的进步催生了多种对分子进行数值编码的方法。分子表示方法的选择会影响应用于化学数据集的学习算法的准确性和通用性。设计和选择合适的表示方法往往缺乏系统的方法，通常需要进行计算量巨大的实证测试。此外，研究表明，在许多任务中，深度学习模型并没有显著优于传统方法，而且对此不足没有明确的解释。在这项工作中，我们提出了TopoLearn，这是一种基于相应特征空间的拓扑特征来预测数据集上表示方法有效性的模型。通过使用可解释性技术，我们发现持久同调描述符与训练后的机器学习模型的误差度量相关联，为更好地理解和选择分子表示提供了一种新方法。科学贡献我们的研究首次在特征空间的拓扑结构与分子表示的机器学习性能之间建立了实证联系。此外，我们通过开放访问我们开发的模型，为未来的研究工作提供便利。