• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

概率随机森林通过考虑实验不确定性,改进了接近分类阈值的生物活性预测。

Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty.

作者信息

Mervin Lewis H, Trapotsi Maria-Anna, Afzal Avid M, Barrett Ian P, Bender Andreas, Engkvist Ola

机构信息

Molecular AI, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK.

Department of Chemistry, Centre for Molecular Informatics, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK.

出版信息

J Cheminform. 2021 Aug 19;13(1):62. doi: 10.1186/s13321-021-00539-7.

DOI:10.1186/s13321-021-00539-7
PMID:34412708
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8375213/
Abstract

Measurements of protein-ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., K versus IC values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4-0.6 log units and when ideal probability estimates between 0.4-0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold.

摘要

由于实验误差,蛋白质 - 配体相互作用的测量存在可重复性限制。基于此类测定的任何模型都会不可避免地存在影响其性能的误差,理论上在建模和输出预测中应考虑这些误差,例如实验测量的实际标准偏差(σ)或在数据集同化期间聚集的异质活性单位之间活性值的相关可比性(即K值与IC值)。然而,实验误差通常是模型生成中被忽视的一个方面。为了改进当前的技术水平,我们在此提出一种使用概率随机森林(PRF)分类器预测蛋白质 - 配体相互作用的新方法。PRF算法被应用于跨ChEMBL和PubChem中约550个任务的计算机模拟蛋白质靶点预测。通过考虑训练集和测试集中实验标准偏差的各种情况来评估预测,并使用五重分层随机拆分进行验证来评估性能。当原始随机森林(RF)算法完全不考虑此类信息时,在PRF中纳入实验偏差对接近二元阈值边界的数据点观察到最大益处。例如,当σ在0.4 - 0.6对数单位范围内且理想概率估计在0.4 - 0.6之间时,PRF的表现优于RF,中位数绝对误差幅度约为17%。相比之下,对于高度确信属于活性类(远离二元决策阈值)的情况,基线RF的表现优于PRF,尽管RF模型给出的误差小于实验不确定性,这可能表明它们过度训练和/或过度自信。最后,与没有假定非活性物质的PRF模型相比,用假定非活性物质训练的PRF模型性能下降,这可能是因为假定非活性物质没有被赋予实验pXC值,因此它们被视为不确定性低的非活性物质(而在实际中可能并非如此)。总之,PRF对于靶点预测模型可能是有用的,特别是对于类别边界与测量不确定性重叠且大部分训练数据位于分类阈值附近的数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/51851ee46970/13321_2021_539_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/88e8de8d8946/13321_2021_539_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/0da8a4912200/13321_2021_539_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/d83c3fe806e3/13321_2021_539_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/610dcadd7e56/13321_2021_539_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/509c2f0347a9/13321_2021_539_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/b84a3eeaa3fe/13321_2021_539_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/51851ee46970/13321_2021_539_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/88e8de8d8946/13321_2021_539_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/0da8a4912200/13321_2021_539_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/d83c3fe806e3/13321_2021_539_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/610dcadd7e56/13321_2021_539_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/509c2f0347a9/13321_2021_539_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/b84a3eeaa3fe/13321_2021_539_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1f8c/8375213/51851ee46970/13321_2021_539_Fig7_HTML.jpg

相似文献

1
Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty.概率随机森林通过考虑实验不确定性,改进了接近分类阈值的生物活性预测。
J Cheminform. 2021 Aug 19;13(1):62. doi: 10.1186/s13321-021-00539-7.
2
Accounting for uncertainty in training data to improve machine learning performance in predicting new disease activity in early multiple sclerosis.考虑训练数据中的不确定性以提高机器学习在预测早期多发性硬化症新疾病活动方面的性能。
Front Neurol. 2023 May 26;14:1165267. doi: 10.3389/fneur.2023.1165267. eCollection 2023.
3
Comparison of Scaling Methods to Obtain Calibrated Probabilities of Activity for Protein-Ligand Predictions.比较用于获得蛋白质-配体预测的校准活性概率的定标方法。
J Chem Inf Model. 2020 Oct 26;60(10):4546-4559. doi: 10.1021/acs.jcim.0c00476. Epub 2020 Sep 21.
4
In silico target prediction for elucidating the mode of action of herbicides including prospective validation.用于阐明除草剂作用模式的计算机辅助靶点预测,包括前瞻性验证。
J Mol Graph Model. 2017 Jan;71:70-79. doi: 10.1016/j.jmgm.2016.10.021. Epub 2016 Nov 6.
5
Evaluating the performance of machine learning methods and variable selection methods for predicting difficult-to-measure traits in Holstein dairy cattle using milk infrared spectral data.利用牛奶近红外光谱数据评估机器学习方法和变量选择方法在荷斯坦奶牛中预测难以测量性状的性能。
J Dairy Sci. 2021 Jul;104(7):8107-8121. doi: 10.3168/jds.2020-19861. Epub 2021 Apr 15.
6
WDL-RF: predicting bioactivities of ligand molecules acting with G protein-coupled receptors by combining weighted deep learning and random forest.WDL-RF:通过结合加权深度学习和随机森林预测与 G 蛋白偶联受体相互作用的配体分子的生物活性。
Bioinformatics. 2018 Jul 1;34(13):2271-2282. doi: 10.1093/bioinformatics/bty070.
7
Target prediction utilising negative bioactivity data covering large chemical space.利用涵盖大化学空间的负生物活性数据进行靶点预测。
J Cheminform. 2015 Oct 24;7:51. doi: 10.1186/s13321-015-0098-y. eCollection 2015.
8
How Consistent are Publicly Reported Cytotoxicity Data? Large-Scale Statistical Analysis of the Concordance of Public Independent Cytotoxicity Measurements.公开报告的细胞毒性数据有多一致?对公开独立细胞毒性测量一致性的大规模统计分析。
ChemMedChem. 2016 Jan 5;11(1):57-71. doi: 10.1002/cmdc.201500424. Epub 2015 Nov 6.
9
Random forest estimation of genomic breeding values for disease susceptibility over different disease incidences and genomic architectures in simulated cow calibration groups.在模拟奶牛校准群体中,针对不同疾病发病率和基因组结构的疾病易感性,采用随机森林法估计基因组育种值。
J Dairy Sci. 2016 Sep;99(9):7261-7273. doi: 10.3168/jds.2016-10887. Epub 2016 Jun 22.
10
The experimental uncertainty of heterogeneous public K(i) data.异质公共 K(i) 数据的实验不确定性。
J Med Chem. 2012 Jun 14;55(11):5165-73. doi: 10.1021/jm300131x. Epub 2012 May 29.

引用本文的文献

1
A novel hybrid model for species distribution prediction using probabilistic random forest, principal component analysis and genetic algorithm.一种使用概率随机森林、主成分分析和遗传算法的新型物种分布预测混合模型。
PLoS One. 2025 Sep 10;20(9):e0326122. doi: 10.1371/journal.pone.0326122. eCollection 2025.
2
Machine Learning for Toxicity Prediction Using Chemical Structures: Pillars for Success in the Real World.利用化学结构进行毒性预测的机器学习:在现实世界中取得成功的支柱。
Chem Res Toxicol. 2025 May 19;38(5):759-807. doi: 10.1021/acs.chemrestox.5c00033. Epub 2025 May 2.
3
A data science roadmap for open science organizations engaged in early-stage drug discovery.

本文引用的文献

1
Performance of Regression Models as a Function of Experiment Noise.回归模型的性能作为实验噪声的函数
Bioinform Biol Insights. 2021 Jun 27;15:11779322211020315. doi: 10.1177/11779322211020315. eCollection 2021.
2
Towards reproducible computational drug discovery.迈向可重复的计算药物发现。
J Cheminform. 2020 Jan 28;12(1):9. doi: 10.1186/s13321-020-0408-x.
3
Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.基于结构-活性关系的高度不平衡Tox21数据集的化学分类
面向早期药物发现的开放科学组织的数据科学路线图。
Nat Commun. 2024 Jul 5;15(1):5640. doi: 10.1038/s41467-024-49777-x.
4
Relationship between prediction accuracy and uncertainty in compound potency prediction using deep neural networks and control models.使用深度神经网络和控制模型进行化合物效力预测时预测准确性与不确定性之间的关系
Sci Rep. 2024 Mar 19;14(1):6536. doi: 10.1038/s41598-024-57135-6.
5
Accounting for uncertainty in training data to improve machine learning performance in predicting new disease activity in early multiple sclerosis.考虑训练数据中的不确定性以提高机器学习在预测早期多发性硬化症新疾病活动方面的性能。
Front Neurol. 2023 May 26;14:1165267. doi: 10.3389/fneur.2023.1165267. eCollection 2023.
6
Comparing the applications of machine learning, PBPK, and population pharmacokinetic models in pharmacokinetic drug-drug interaction prediction.比较机器学习、PBPK 和群体药代动力学模型在药代动力学药物相互作用预测中的应用。
CPT Pharmacometrics Syst Pharmacol. 2022 Dec;11(12):1560-1568. doi: 10.1002/psp4.12870. Epub 2022 Oct 12.
7
Uncertainty quantification: Can we trust artificial intelligence in drug discovery?不确定性量化:在药物研发中我们能信任人工智能吗?
iScience. 2022 Jul 21;25(8):104814. doi: 10.1016/j.isci.2022.104814. eCollection 2022 Aug 19.
J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.
4
Uncertainty quantification in drug design.药物设计中的不确定性量化。
Drug Discov Today. 2021 Feb;26(2):474-489. doi: 10.1016/j.drudis.2020.11.027. Epub 2020 Nov 27.
5
Comparison of Scaling Methods to Obtain Calibrated Probabilities of Activity for Protein-Ligand Predictions.比较用于获得蛋白质-配体预测的校准活性概率的定标方法。
J Chem Inf Model. 2020 Oct 26;60(10):4546-4559. doi: 10.1021/acs.jcim.0c00476. Epub 2020 Sep 21.
6
Using Predicted Bioactivity Profiles to Improve Predictive Modeling.利用预测的生物活性谱来改进预测模型。
J Chem Inf Model. 2020 Jun 22;60(6):2830-2837. doi: 10.1021/acs.jcim.0c00250. Epub 2020 May 15.
7
Targeting the uncertainty of predictions at patient-level using an ensemble of classifiers coupled with calibration methods, Venn-ABERS, and Conformal Predictors: A case study in AD.使用分类器集成结合校准方法、Venn-ABERS 和共形预测器来针对患者层面预测的不确定性:阿尔茨海默病的案例研究
J Biomed Inform. 2020 Jan;101:103350. doi: 10.1016/j.jbi.2019.103350. Epub 2019 Dec 6.
8
Autonomous Molecular Design: Then and Now.自主分子设计:过去与现在。
ACS Appl Mater Interfaces. 2019 Jul 17;11(28):24825-24836. doi: 10.1021/acsami.9b01226. Epub 2019 Mar 25.
9
The convergence of artificial intelligence and chemistry for improved drug discovery.人工智能与化学相结合以改进药物发现。
Future Med Chem. 2018 Nov;10(22):2573-2576. doi: 10.4155/fmc-2018-0161. Epub 2018 Nov 30.
10
Deep Confidence: A Computationally Efficient Framework for Calculating Reliable Prediction Errors for Deep Neural Networks.深度置信度:一种用于计算深度神经网络可靠预测误差的计算效率高的框架。
J Chem Inf Model. 2019 Mar 25;59(3):1269-1281. doi: 10.1021/acs.jcim.8b00542. Epub 2018 Oct 30.