• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

小分子机器学习中的覆盖偏差

Coverage bias in small molecule machine learning.

作者信息

Kretschmer Fleming, Seipp Jan, Ludwig Marcus, Klau Gunnar W, Böcker Sebastian

机构信息

Chair for Bioinformatics, Institute for Computer Science, Friedrich Schiller University Jena, Jena, Germany.

Algorithmic Bioinformatics, Institute for Computer Science, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.

出版信息

Nat Commun. 2025 Jan 9;16(1):554. doi: 10.1038/s41467-024-55462-w.

DOI:10.1038/s41467-024-55462-w
PMID:39788952
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11718084/
Abstract

Small molecule machine learning aims to predict chemical, biochemical, or biological properties from molecular structures, with applications such as toxicity prediction, ligand binding, and pharmacokinetics. A recent trend is developing end-to-end models that avoid explicit domain knowledge. These models assume no coverage bias in training and evaluation data, meaning the data are representative of the true distribution. However, the domain of applicability is rarely considered in such models. Here, we investigate how well large-scale datasets cover the space of known biomolecular structures. For doing so, we propose a distance measure based on solving the Maximum Common Edge Subgraph (MCES) problem, which aligns well with chemical similarity. Although this method is computationally hard, we introduce an efficient approach combining Integer Linear Programming and heuristic bounds. Our findings reveal that many widely-used datasets lack uniform coverage of biomolecular structures, limiting the predictive power of models trained on them. We propose two additional methods to assess whether training datasets diverge from known molecular distributions, potentially guiding future dataset creation to improve model performance.

摘要

小分子机器学习旨在根据分子结构预测化学、生物化学或生物学性质,应用于毒性预测、配体结合和药代动力学等领域。最近的一个趋势是开发避免明确领域知识的端到端模型。这些模型假定训练和评估数据中不存在覆盖偏差,这意味着数据代表了真实分布。然而,此类模型很少考虑适用范围。在这里,我们研究大规模数据集对已知生物分子结构空间的覆盖程度。为此,我们提出了一种基于解决最大公共边子图(MCES)问题的距离度量方法,该方法与化学相似性非常契合。尽管这种方法计算量很大,但我们引入了一种结合整数线性规划和启发式边界的有效方法。我们的研究结果表明,许多广泛使用的数据集缺乏对生物分子结构的均匀覆盖,限制了在这些数据集上训练的模型的预测能力。我们还提出了另外两种方法来评估训练数据集是否偏离已知分子分布,这可能为未来数据集的创建提供指导,以提高模型性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/c3ce53705313/41467_2024_55462_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/ceebb70e091a/41467_2024_55462_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/1b3d2559a824/41467_2024_55462_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/a1d0f22523c0/41467_2024_55462_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/67a4752b785c/41467_2024_55462_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/3b85f1d9c99a/41467_2024_55462_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/b8a9bec79c4b/41467_2024_55462_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/5ef089402f0e/41467_2024_55462_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/7c5f04872198/41467_2024_55462_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/c3ce53705313/41467_2024_55462_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/ceebb70e091a/41467_2024_55462_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/1b3d2559a824/41467_2024_55462_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/a1d0f22523c0/41467_2024_55462_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/67a4752b785c/41467_2024_55462_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/3b85f1d9c99a/41467_2024_55462_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/b8a9bec79c4b/41467_2024_55462_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/5ef089402f0e/41467_2024_55462_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/7c5f04872198/41467_2024_55462_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/20e9/11718084/c3ce53705313/41467_2024_55462_Fig9_HTML.jpg

相似文献

1
Coverage bias in small molecule machine learning.小分子机器学习中的覆盖偏差
Nat Commun. 2025 Jan 9;16(1):554. doi: 10.1038/s41467-024-55462-w.
2
Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices.在新合成数据集上训练的集成机器学习模型,对于使用可穿戴设备进行压力预测具有良好的泛化能力。
J Biomed Inform. 2023 Dec;148:104556. doi: 10.1016/j.jbi.2023.104556. Epub 2023 Dec 2.
3
General Approach to Estimate Error Bars for Quantitative Structure-Activity Relationship Predictions of Molecular Activity.定量构效关系预测分子活性的误差估计的一般方法。
J Chem Inf Model. 2018 Aug 27;58(8):1561-1575. doi: 10.1021/acs.jcim.8b00114. Epub 2018 Jul 17.
4
Beware of machine learning-based scoring functions-on the danger of developing black boxes.警惕基于机器学习的评分函数——开发黑盒的危险。
J Chem Inf Model. 2014 Oct 27;54(10):2807-15. doi: 10.1021/ci500406k. Epub 2014 Sep 24.
5
Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets.凸包分析在评估不同来源患者群体间数据异质性及医院偏差在基于机器学习的下游数据处理中的影响中的应用:4个重症监护患者数据集的比较
Front Big Data. 2022 Oct 31;5:603429. doi: 10.3389/fdata.2022.603429. eCollection 2022.
6
Prediction method of pharmacokinetic parameters of small molecule drugs based on GCN network model.基于 GCN 网络模型的小分子药物药代动力学参数预测方法。
J Mol Model. 2024 Jul 12;30(8):264. doi: 10.1007/s00894-024-06051-7.
7
Progress of machine learning in the application of small molecule druggability prediction.机器学习在小分子药物可成药性预测应用中的进展
Eur J Med Chem. 2025 Mar 5;285:117269. doi: 10.1016/j.ejmech.2025.117269. Epub 2025 Jan 10.
8
Incorporating Explicit Water Molecules and Ligand Conformation Stability in Machine-Learning Scoring Functions.将显式水分子和配体构象稳定性纳入机器学习打分函数中。
J Chem Inf Model. 2019 Nov 25;59(11):4540-4549. doi: 10.1021/acs.jcim.9b00645. Epub 2019 Oct 31.
9
Training based on ligand efficiency improves prediction of bioactivities of ligands and drug target proteins in a machine learning approach.基于配体效率的训练可以提高机器学习方法中配体和药物靶标蛋白生物活性预测的准确性。
J Chem Inf Model. 2013 Oct 28;53(10):2525-37. doi: 10.1021/ci400240u. Epub 2013 Sep 24.
10
BASE: a web service for providing compound-protein binding affinity prediction datasets with reduced similarity bias.BASE:一个提供具有降低相似性偏差的化合物-蛋白质结合亲和力预测数据集的网络服务。
BMC Bioinformatics. 2024 Oct 30;25(1):340. doi: 10.1186/s12859-024-05968-3.

引用本文的文献

1
Sex-specific lipidomic signatures in aortic valve disease reflect differential fibro-calcific progression.主动脉瓣疾病中的性别特异性脂质组学特征反映了不同的纤维钙化进展。
Nat Commun. 2025 Jun 3;16(1):5163. doi: 10.1038/s41467-025-60411-2.
2
Integrating AI/ML and multi-omics approaches to investigate the role of TNFRSF10A/TRAILR1 and its potential targets in pancreatic cancer.整合人工智能/机器学习和多组学方法以研究TNFRSF10A/TRAILR1及其潜在靶点在胰腺癌中的作用。
Comput Biol Med. 2025 Jul;193:110432. doi: 10.1016/j.compbiomed.2025.110432. Epub 2025 May 26.
3
Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS.

本文引用的文献

1
Dynamic visualization of high-dimensional data.高维数据的动态可视化。
Nat Comput Sci. 2023 Jan;3(1):86-100. doi: 10.1038/s43588-022-00380-4. Epub 2022 Dec 30.
2
Discovery of a structural class of antibiotics with explainable deep learning.发现具有可解释深度学习的抗生素结构类别。
Nature. 2024 Feb;626(7997):177-185. doi: 10.1038/s41586-023-06887-8. Epub 2023 Dec 20.
3
Is AI leading to a reproducibility crisis in science?人工智能正在引发科学领域的可重复性危机吗?
使用DreaMS从数百万个串联质谱中进行分子表征的自监督学习。
Nat Biotechnol. 2025 May 23. doi: 10.1038/s41587-025-02663-3.
4
Machine Learning in Drug Development for Neurological Diseases: A Review of Blood Brain Barrier Permeability Prediction Models.用于神经疾病药物研发的机器学习:血脑屏障通透性预测模型综述
Mol Inform. 2025 Mar;44(3):e202400325. doi: 10.1002/minf.202400325.
Nature. 2023 Dec;624(7990):22-25. doi: 10.1038/d41586-023-03817-6.
4
The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods.2023 年的 ChEMBL 数据库:一个涵盖多种生物活性数据类型和时间段的药物发现平台。
Nucleic Acids Res. 2024 Jan 5;52(D1):D1180-D1192. doi: 10.1093/nar/gkad1004.
5
A systematic study of key elements underlying molecular property prediction.对分子性质预测背后关键要素的系统研究。
Nat Commun. 2023 Oct 13;14(1):6395. doi: 10.1038/s41467-023-41948-6.
6
Leakage and the reproducibility crisis in machine-learning-based science.基于机器学习的科学中的漏洞与可重复性危机。
Patterns (N Y). 2023 Aug 4;4(9):100804. doi: 10.1016/j.patter.2023.100804. eCollection 2023 Sep 8.
7
A principal odor map unifies diverse tasks in olfactory perception.主嗅觉图将嗅觉感知中的各种任务统一起来。
Science. 2023 Sep;381(6661):999-1006. doi: 10.1126/science.ade4401. Epub 2023 Aug 31.
8
The specious art of single-cell genomics.单细胞基因组学的似是而非的艺术。
PLoS Comput Biol. 2023 Aug 17;19(8):e1011288. doi: 10.1371/journal.pcbi.1011288. eCollection 2023 Aug.
9
Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii.深度学习指导下针对鲍曼不动杆菌的抗生素的发现。
Nat Chem Biol. 2023 Nov;19(11):1342-1350. doi: 10.1038/s41589-023-01349-8. Epub 2023 May 25.
10
A general model to predict small molecule substrates of enzymes based on machine and deep learning.基于机器学习和深度学习的酶小分子底物通用预测模型。
Nat Commun. 2023 May 15;14(1):2787. doi: 10.1038/s41467-023-38347-2.