Suppr超能文献

小分子机器学习中的覆盖偏差

Coverage bias in small molecule machine learning.

作者信息

Kretschmer Fleming, Seipp Jan, Ludwig Marcus, Klau Gunnar W, Böcker Sebastian

机构信息

Chair for Bioinformatics, Institute for Computer Science, Friedrich Schiller University Jena, Jena, Germany.

Algorithmic Bioinformatics, Institute for Computer Science, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.

出版信息

Nat Commun. 2025 Jan 9;16(1):554. doi: 10.1038/s41467-024-55462-w.

Abstract

Small molecule machine learning aims to predict chemical, biochemical, or biological properties from molecular structures, with applications such as toxicity prediction, ligand binding, and pharmacokinetics. A recent trend is developing end-to-end models that avoid explicit domain knowledge. These models assume no coverage bias in training and evaluation data, meaning the data are representative of the true distribution. However, the domain of applicability is rarely considered in such models. Here, we investigate how well large-scale datasets cover the space of known biomolecular structures. For doing so, we propose a distance measure based on solving the Maximum Common Edge Subgraph (MCES) problem, which aligns well with chemical similarity. Although this method is computationally hard, we introduce an efficient approach combining Integer Linear Programming and heuristic bounds. Our findings reveal that many widely-used datasets lack uniform coverage of biomolecular structures, limiting the predictive power of models trained on them. We propose two additional methods to assess whether training datasets diverge from known molecular distributions, potentially guiding future dataset creation to improve model performance.

摘要

小分子机器学习旨在根据分子结构预测化学、生物化学或生物学性质,应用于毒性预测、配体结合和药代动力学等领域。最近的一个趋势是开发避免明确领域知识的端到端模型。这些模型假定训练和评估数据中不存在覆盖偏差,这意味着数据代表了真实分布。然而,此类模型很少考虑适用范围。在这里,我们研究大规模数据集对已知生物分子结构空间的覆盖程度。为此,我们提出了一种基于解决最大公共边子图(MCES)问题的距离度量方法,该方法与化学相似性非常契合。尽管这种方法计算量很大,但我们引入了一种结合整数线性规划和启发式边界的有效方法。我们的研究结果表明,许多广泛使用的数据集缺乏对生物分子结构的均匀覆盖,限制了在这些数据集上训练的模型的预测能力。我们还提出了另外两种方法来评估训练数据集是否偏离已知分子分布,这可能为未来数据集的创建提供指导,以提高模型性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/ceebb70e091a/41467_2024_55462_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验