• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

以化学毒性数据为例,研究和减轻数据漂移对机器学习模型性能的影响。

Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data.

机构信息

In Silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité Universitätsmedizin Berlin, Berlin, 10117, Germany.

BASF SE, 67056, Ludwigshafen, Germany.

出版信息

Sci Rep. 2022 May 4;12(1):7244. doi: 10.1038/s41598-022-09309-3.

DOI:10.1038/s41598-022-09309-3
PMID:35508546
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9068909/
Abstract

Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly-available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models.

摘要

机器学习模型被广泛应用于预测小分子在特定蛋白质上的分子性质或生物活性。模型可以集成到保角预测(CP)框架中,该框架增加了一个校准步骤来估计预测的置信度。CP 模型的优点是在假设测试集和校准集可交换的情况下,保证了预定的错误率。在测试数据已经偏离训练数据的描述符空间的情况下,或者在测定方案发生变化的情况下,这种假设可能不成立,并且模型不一定有效。在这项研究中,评估了应用于新时间分割数据或外部数据的内部有效的 CP 模型的性能。具体来说,基于来自 ChEMBL 数据库的十二个数据集分析了时间数据漂移。此外,还研究了在肝毒性和 MNT 体内终点上,基于公开可用数据训练的模型应用于专有数据时的模型之间的差异。在大多数情况下,当应用于时间分割或外部(保留)测试集时,模型的有效性会急剧下降。为了克服模型有效性的下降,研究了一种使用与保留集更相似的数据更新校准集的策略。更新校准集通常会提高有效性,在许多情况下完全恢复到预期值。恢复的有效性是自信地应用 CP 模型的首要前提。然而,有效性的提高是以模型效率的降低为代价的,因为更多的预测被认为是不确定的。本研究提出了一种重新校准 CP 模型以减轻数据漂移影响的策略。无需重新训练模型即可更新校准集已被证明是恢复大多数模型有效性的有用方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1984/9068909/868a3e20eb52/41598_2022_9309_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1984/9068909/56fc3ff21117/41598_2022_9309_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1984/9068909/911fd1dc1e0d/41598_2022_9309_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1984/9068909/6352f1601d11/41598_2022_9309_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1984/9068909/868a3e20eb52/41598_2022_9309_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1984/9068909/56fc3ff21117/41598_2022_9309_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1984/9068909/911fd1dc1e0d/41598_2022_9309_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1984/9068909/6352f1601d11/41598_2022_9309_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1984/9068909/868a3e20eb52/41598_2022_9309_Fig4_HTML.jpg

相似文献

1
Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data.以化学毒性数据为例,研究和减轻数据漂移对机器学习模型性能的影响。
Sci Rep. 2022 May 4;12(1):7244. doi: 10.1038/s41598-022-09309-3.
2
Assessing the calibration in toxicological in vitro models with conformal prediction.使用共形预测评估毒理学体外模型中的校准。
J Cheminform. 2021 Apr 29;13(1):35. doi: 10.1186/s13321-021-00511-5.
3
ChemBioSim: Enhancing Conformal Prediction of In Vivo Toxicity by Use of Predicted Bioactivities.ChemBioSim:通过预测的生物活性增强体内毒性的一致性预测
J Chem Inf Model. 2021 Jul 26;61(7):3255-3272. doi: 10.1021/acs.jcim.1c00451. Epub 2021 Jun 21.
4
Machine Learning Strategies When Transitioning between Biological Assays.机器学习策略在生物学检测中的转换。
J Chem Inf Model. 2021 Jul 26;61(7):3722-3733. doi: 10.1021/acs.jcim.1c00293. Epub 2021 Jun 21.
5
Machine-learning Models Predict 30-Day Mortality, Cardiovascular Complications, and Respiratory Complications After Aseptic Revision Total Joint Arthroplasty.机器学习模型预测无菌翻修全关节置换术后 30 天死亡率、心血管并发症和呼吸系统并发症。
Clin Orthop Relat Res. 2022 Nov 1;480(11):2137-2145. doi: 10.1097/CORR.0000000000002276. Epub 2022 Jun 20.
6
7
A novel lifelong machine learning-based method to eliminate calibration drift in clinical prediction models.一种基于机器学习的新方法,可消除临床预测模型中的校准漂移。
Artif Intell Med. 2022 Mar;125:102256. doi: 10.1016/j.artmed.2022.102256. Epub 2022 Feb 12.
8
Dynamic applicability domain (dAD): compound-target binding affinity estimates with local conformal prediction.动态适用域 (dAD):基于局部共形预测的化合物-靶标结合亲和力估计。
Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad465.
9
Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets.识别基于公开可用数据集和专有数据集训练的针对脱靶的机器学习模型性能的差异。
Chem Res Toxicol. 2023 Aug 21;36(8):1300-1312. doi: 10.1021/acs.chemrestox.3c00042. Epub 2023 Jul 13.
10
External validation: a simulation study to compare cross-validation versus holdout or external testing to assess the performance of clinical prediction models using PET data from DLBCL patients.外部验证:一项模拟研究,比较交叉验证与留出法或外部测试,以使用弥漫性大B细胞淋巴瘤(DLBCL)患者的PET数据评估临床预测模型的性能。
EJNMMI Res. 2022 Sep 11;12(1):58. doi: 10.1186/s13550-022-00931-w.

引用本文的文献

1
Susceptibility of AutoML mortality prediction algorithms to model drift caused by the COVID pandemic.自动化机器学习死亡率预测算法对由 COVID 大流行引起的模型漂移的敏感性。
BMC Med Inform Decis Mak. 2024 Feb 2;24(1):34. doi: 10.1186/s12911-024-02428-z.
2
The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods.2023 年的 ChEMBL 数据库:一个涵盖多种生物活性数据类型和时间段的药物发现平台。
Nucleic Acids Res. 2024 Jan 5;52(D1):D1180-D1192. doi: 10.1093/nar/gkad1004.
3
Machine Learning Methods for Small Data Challenges in Molecular Science.

本文引用的文献

1
ChemBioSim: Enhancing Conformal Prediction of In Vivo Toxicity by Use of Predicted Bioactivities.ChemBioSim:通过预测的生物活性增强体内毒性的一致性预测
J Chem Inf Model. 2021 Jul 26;61(7):3255-3272. doi: 10.1021/acs.jcim.1c00451. Epub 2021 Jun 21.
2
Machine Learning Strategies When Transitioning between Biological Assays.机器学习策略在生物学检测中的转换。
J Chem Inf Model. 2021 Jul 26;61(7):3722-3733. doi: 10.1021/acs.jcim.1c00293. Epub 2021 Jun 21.
3
Assessing the calibration in toxicological in vitro models with conformal prediction.
机器学习方法在分子科学中小数据挑战中的应用。
Chem Rev. 2023 Jul 12;123(13):8736-8780. doi: 10.1021/acs.chemrev.3c00189. Epub 2023 Jun 29.
使用共形预测评估毒理学体外模型中的校准。
J Cheminform. 2021 Apr 29;13(1):35. doi: 10.1186/s13321-021-00511-5.
4
Prediction of Oral Pharmacokinetics Using a Combination of In Silico Descriptors and In Vitro ADME Properties.利用体内外 ADME 特性与计算描述符组合预测口服药代动力学。
Mol Pharm. 2021 Mar 1;18(3):1071-1079. doi: 10.1021/acs.molpharmaceut.0c01009. Epub 2021 Jan 29.
5
Prediction of pharmacological activities from chemical structures with graph convolutional neural networks.利用图卷积神经网络从化学结构预测药理活性。
Sci Rep. 2021 Jan 12;11(1):525. doi: 10.1038/s41598-020-80113-7.
6
QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping.基于定量构效关系的亲和力指纹(第1部分):用于相似性搜索、生物活性分类和骨架跃迁的指纹构建与建模性能
J Cheminform. 2020 May 29;12(1):39. doi: 10.1186/s13321-020-00443-6.
7
QSAR-derived affinity fingerprints (part 2): modeling performance for potency prediction.基于定量构效关系的亲和力指纹图谱(第2部分):效能预测的建模性能
J Cheminform. 2020 Jun 5;12(1):41. doi: 10.1186/s13321-020-00444-5.
8
KnowTox: pipeline and case study for confident prediction of potential toxic effects of compounds in early phases of development.KnowTox:用于在开发早期阶段可靠预测化合物潜在毒性作用的流程及案例研究。
J Cheminform. 2020 Apr 14;12(1):24. doi: 10.1186/s13321-020-00422-x.
9
HuskinDB, a database for skin permeation of xenobiotics.HuskinDB,一种用于外源性物质皮肤渗透的数据库。
Sci Data. 2020 Dec 1;7(1):426. doi: 10.1038/s41597-020-00764-z.
10
Predicting With Confidence: Using Conformal Prediction in Drug Discovery.有信心的预测:在药物发现中使用一致性预测。
J Pharm Sci. 2021 Jan;110(1):42-49. doi: 10.1016/j.xphs.2020.09.055. Epub 2020 Oct 17.