随机森林替代多元线性回归可提高评分函数结合亲和力预测的准确性：以 Cyscore 为例。

Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study.

机构信息

Department of Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong, China.

出版信息

BMC Bioinformatics. 2014 Aug 27;15(1):291. doi: 10.1186/1471-2105-15-291.

DOI:10.1186/1471-2105-15-291

PMID:25159129

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4153907/

Abstract

BACKGROUND

State-of-the-art protein-ligand docking methods are generally limited by the traditionally low accuracy of their scoring functions, which are used to predict binding affinity and thus vital for discriminating between active and inactive compounds. Despite intensive research over the years, classical scoring functions have reached a plateau in their predictive performance. These assume a predetermined additive functional form for some sophisticated numerical features, and use standard multivariate linear regression (MLR) on experimental data to derive the coefficients.

RESULTS

In this study we show that such a simple functional form is detrimental for the prediction performance of a scoring function, and replacing linear regression by machine learning techniques like random forest (RF) can improve prediction performance. We investigate the conditions of applying RF under various contexts and find that given sufficient training samples RF manages to comprehensively capture the non-linearity between structural features and measured binding affinities. Incorporating more structural features and training with more samples can both boost RF performance. In addition, we analyze the importance of structural features to binding affinity prediction using the RF variable importance tool. Lastly, we use Cyscore, a top performing empirical scoring function, as a baseline for comparison study.

CONCLUSIONS

Machine-learning scoring functions are fundamentally different from classical scoring functions because the former circumvents the fixed functional form relating structural features with binding affinities. RF, but not MLR, can effectively exploit more structural features and more training samples, leading to higher prediction performance. The future availability of more X-ray crystal structures will further widen the performance gap between RF-based and MLR-based scoring functions. This further stresses the importance of substituting RF for MLR in scoring function development.

摘要

背景

最先进的蛋白质-配体对接方法通常受到其打分函数传统上准确性低的限制，打分函数用于预测结合亲和力，因此对于区分活性和非活性化合物至关重要。尽管多年来进行了密集的研究，但经典的打分函数在其预测性能方面已经达到了一个瓶颈。这些打分函数假设一些复杂的数值特征具有预定的加性函数形式，并使用标准多元线性回归（MLR）对实验数据进行分析以得出系数。

结果

在本研究中，我们表明这种简单的函数形式不利于打分函数的预测性能，并且用机器学习技术（如随机森林（RF））替代线性回归可以提高预测性能。我们研究了在各种情况下应用 RF 的条件，并发现只要有足够的训练样本，RF 就能够全面捕捉结构特征与测量结合亲和力之间的非线性关系。纳入更多结构特征和使用更多样本进行训练都可以提高 RF 性能。此外，我们使用 RF 变量重要性工具分析结构特征对结合亲和力预测的重要性。最后，我们使用表现最佳的经验打分函数 Cyscore 作为比较研究的基线。

结论

机器学习打分函数与经典打分函数在根本上不同，因为前者回避了将结构特征与结合亲和力联系起来的固定函数形式。RF 而不是 MLR 可以有效地利用更多的结构特征和更多的训练样本，从而提高预测性能。未来更多 X 射线晶体结构的可用性将进一步扩大 RF 基于和 MLR 基于打分函数之间的性能差距。这进一步强调了在打分函数开发中用 RF 替代 MLR 的重要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6674/4153907/a7e62a20f6df/12859_2014_6553_Fig1_HTML.jpg

相似文献

Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study.随机森林替代多元线性回归可提高评分函数结合亲和力预测的准确性：以 Cyscore 为例。

BMC Bioinformatics. 2014 Aug 27;15(1):291. doi: 10.1186/1471-2105-15-291.

A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking.一种基于机器学习的蛋白质 - 配体结合亲和力预测方法及其在分子对接中的应用。

Bioinformatics. 2010 May 1;26(9):1169-75. doi: 10.1093/bioinformatics/btq112. Epub 2010 Mar 17.

Machine learning in computational docking.计算对接中的机器学习。

Artif Intell Med. 2015 Mar;63(3):135-52. doi: 10.1016/j.artmed.2015.02.002. Epub 2015 Feb 16.

Improving classical scoring functions using random forest: The non-additivity of free energy terms' contributions in binding.利用随机森林改进经典评分函数：结合中自由能项贡献的非加和性。

Chem Biol Drug Des. 2018 Aug;92(2):1429-1434. doi: 10.1111/cbdd.13206. Epub 2018 Apr 27.

Does a more precise chemical description of protein-ligand complexes lead to more accurate prediction of binding affinity?对蛋白质-配体复合物进行更精确的化学描述是否能更准确地预测结合亲和力？

J Chem Inf Model. 2014 Mar 24;54(3):944-55. doi: 10.1021/ci500091r. Epub 2014 Feb 20.

Learning from the ligand: using ligand-based features to improve binding affinity prediction.从配体中学习：利用基于配体的特征来提高结合亲和力预测。

Bioinformatics. 2020 Feb 1;36(3):758-764. doi: 10.1093/bioinformatics/btz665.

CScore: a simple yet effective scoring function for protein-ligand binding affinity prediction using modified CMAC learning architecture.CScore：一种使用改进的CMAC学习架构进行蛋白质-配体结合亲和力预测的简单而有效的评分函数。

J Bioinform Comput Biol. 2011 Dec;9 Suppl 1:1-14. doi: 10.1142/s021972001100577x.

BgN-Score and BsN-Score: bagging and boosting based ensemble neural networks scoring functions for accurate binding affinity prediction of protein-ligand complexes.BgN分数和BsN分数：基于装袋法和提升法的集成神经网络评分函数，用于准确预测蛋白质-配体复合物的结合亲和力。

BMC Bioinformatics. 2015;16 Suppl 4(Suppl 4):S8. doi: 10.1186/1471-2105-16-S4-S8. Epub 2015 Feb 23.

SFCscore(RF): a random forest-based scoring function for improved affinity prediction of protein-ligand complexes.SFCscore（RF）：一种基于随机森林的打分函数，可提高蛋白-配体复合物亲和力预测的准确性。

J Chem Inf Model. 2013 Aug 26;53(8):1923-33. doi: 10.1021/ci400120b. Epub 2013 Jun 10.

Empirical Scoring Functions for Affinity Prediction of Protein-ligand Complexes.用于蛋白质-配体复合物亲和力预测的经验评分函数

Mol Inform. 2016 Dec;35(11-12):541-548. doi: 10.1002/minf.201600048. Epub 2016 Jul 8.

引用本文的文献

Comprehensive machine learning boosts structure-based virtual screening for PARP1 inhibitors.综合机器学习助力基于结构的PARP1抑制剂虚拟筛选。

J Cheminform. 2024 Apr 7;16(1):40. doi: 10.1186/s13321-024-00832-1.

Improving structure-based protein-ligand affinity prediction by graph representation learning and ensemble learning.通过图表示学习和集成学习提高基于结构的蛋白配体亲和力预测。

PLoS One. 2024 Jan 17;19(1):e0296676. doi: 10.1371/journal.pone.0296676. eCollection 2024.

Advancements in small molecule drug design: A structural perspective.小分子药物设计的进展：结构视角。

Drug Discov Today. 2023 Oct;28(10):103730. doi: 10.1016/j.drudis.2023.103730. Epub 2023 Aug 1.

MetaScore: A Novel Machine-Learning-Based Approach to Improve Traditional Scoring Functions for Scoring Protein-Protein Docking Conformations.MetaScore：一种改进基于传统打分函数的蛋白质-蛋白质对接构象打分方法的新型机器学习方法。

Biomolecules. 2023 Jan 6;13(1):121. doi: 10.3390/biom13010121.

Target-Specific Machine Learning Scoring Function Improved Structure-Based Virtual Screening Performance for SARS-CoV-2 Drugs Development.基于靶标特异性机器学习打分函数的结构虚拟筛选方法提高了 SARS-CoV-2 药物研发的效率。

Int J Mol Sci. 2022 Sep 20;23(19):11003. doi: 10.3390/ijms231911003.

Protein Function Analysis through Machine Learning.基于机器学习的蛋白质功能分析。

Biomolecules. 2022 Sep 6;12(9):1246. doi: 10.3390/biom12091246.

Delta Machine Learning to Improve Scoring-Ranking-Screening Performances of Protein-Ligand Scoring Functions.利用 Delta 机器学习改进蛋白质配体打分函数的评分-排名-筛选性能。

J Chem Inf Model. 2022 Jun 13;62(11):2696-2712. doi: 10.1021/acs.jcim.2c00485. Epub 2022 May 17.

Artificial intelligence in the prediction of protein-ligand interactions: recent advances and future directions.人工智能在蛋白质-配体相互作用预测中的应用：最新进展与未来方向。

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab476.

ESIDE: A computationally intelligent method to identify earthworm species (E. fetida) from digital images: Application in taxonomy.ESIDE：一种从数字图像中识别蚯蚓物种（E. fetida）的计算智能方法：在分类学中的应用。

PLoS One. 2021 Sep 16;16(9):e0255674. doi: 10.1371/journal.pone.0255674. eCollection 2021.

Machine-learning scoring functions trained on complexes dissimilar to the test set already outperform classical counterparts on a blind benchmark.基于与测试集不相似的复合物进行训练的机器学习评分函数，在盲基准测试中已经优于经典对应物。

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab225.

本文引用的文献

Improved protein-ligand binding affinity prediction by using a curvature-dependent surface-area model.利用曲率相关表面积模型提高蛋白质配体结合亲和力预测。

Bioinformatics. 2014 Jun 15;30(12):1674-80. doi: 10.1093/bioinformatics/btu104. Epub 2014 Feb 21.

J Chem Inf Model. 2014 Mar 24;54(3):944-55. doi: 10.1021/ci500091r. Epub 2014 Feb 20.

istar: a web platform for large-scale protein-ligand docking.istar：一个用于大规模蛋白质配体对接的网络平台。

PLoS One. 2014 Jan 24;9(1):e85678. doi: 10.1371/journal.pone.0085678. eCollection 2014.

Binding affinity prediction for protein-ligand complexes based on β contacts and B factor.基于β接触和 B 因子的蛋白质-配体复合物结合亲和力预测。

J Chem Inf Model. 2013 Nov 25;53(11):3076-85. doi: 10.1021/ci400450h. Epub 2013 Nov 5.

One Size Does Not Fit All: The Limits of Structure-Based Models in Drug Discovery.一刀切并不适用：基于结构的模型在药物发现中的局限性。

J Chem Theory Comput. 2013 Sep 10;9(9):4266-4274. doi: 10.1021/ct4004228. Epub 2013 Aug 5.

J Chem Inf Model. 2013 Aug 26;53(8):1923-33. doi: 10.1021/ci400120b. Epub 2013 Jun 10.

ID-Score: a new empirical scoring function based on a comprehensive set of descriptors related to protein-ligand interactions.ID-Score：一种新的基于与蛋白质-配体相互作用相关的综合描述符集的经验评分函数。

J Chem Inf Model. 2013 Mar 25;53(3):592-600. doi: 10.1021/ci300493w. Epub 2013 Feb 26.

Drug repositioning by structure-based virtual screening.基于结构的虚拟筛选的药物重定位。

Chem Soc Rev. 2013 Mar 7;42(5):2130-41. doi: 10.1039/c2cs35357a. Epub 2013 Jan 4.

Hierarchical virtual screening for the discovery of new molecular scaffolds in antibacterial hit identification.层次虚拟筛选在抗菌命中鉴定中发现新的分子骨架。

J R Soc Interface. 2012 Dec 7;9(77):3196-207. doi: 10.1098/rsif.2012.0569. Epub 2012 Aug 29.

DoGSiteScorer: a web server for automatic binding site prediction, analysis and druggability assessment.DoGSiteScorer：一个用于自动结合部位预测、分析和可药性评估的网络服务器。

Bioinformatics. 2012 Aug 1;28(15):2074-5. doi: 10.1093/bioinformatics/bts310. Epub 2012 May 23.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

随机森林替代多元线性回归可提高评分函数结合亲和力预测的准确性：以 Cyscore 为例。

Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献