• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

随机森林在 QSPR 模型中的特征选择 - 预测碳氢化合物标准生成焓的应用。

Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons.

机构信息

LaSIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisboa, Portugal.

出版信息

J Cheminform. 2013 Feb 11;5(1):9. doi: 10.1186/1758-2946-5-9.

DOI:10.1186/1758-2946-5-9
PMID:23399299
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3599435/
Abstract

BACKGROUND

One of the main topics in the development of quantitative structure-property relationship (QSPR) predictive models is the identification of the subset of variables that represent the structure of a molecule and which are predictors for a given property. There are several automated feature selection methods, ranging from backward, forward or stepwise procedures, to further elaborated methodologies such as evolutionary programming. The problem lies in selecting the minimum subset of descriptors that can predict a certain property with a good performance, computationally efficient and in a more robust way, since the presence of irrelevant or redundant features can cause poor generalization capacity. In this paper an alternative selection method, based on Random Forests to determine the variable importance is proposed in the context of QSPR regression problems, with an application to a manually curated dataset for predicting standard enthalpy of formation. The subsequent predictive models are trained with support vector machines introducing the variables sequentially from a ranked list based on the variable importance.

RESULTS

The model generalizes well even with a high dimensional dataset and in the presence of highly correlated variables. The feature selection step was shown to yield lower prediction errors with RMSE values 23% lower than without feature selection, albeit using only 6% of the total number of variables (89 from the original 1485). The proposed approach further compared favourably with other feature selection methods and dimension reduction of the feature space. The predictive model was selected using a 10-fold cross validation procedure and, after selection, it was validated with an independent set to assess its performance when applied to new data and the results were similar to the ones obtained for the training set, supporting the robustness of the proposed approach.

CONCLUSIONS

The proposed methodology seemingly improves the prediction performance of standard enthalpy of formation of hydrocarbons using a limited set of molecular descriptors, providing faster and more cost-effective calculation of descriptors by reducing their numbers, and providing a better understanding of the underlying relationship between the molecular structure represented by descriptors and the property of interest.

摘要

背景

定量构效关系(QSPR)预测模型发展的主要课题之一是确定代表分子结构并可预测特定性质的变量子集。有几种自动化特征选择方法,范围从后向、前向或逐步程序,到进一步详细阐述的方法,如进化编程。问题在于选择最小的描述符子集,以便以良好的性能、计算效率和更稳健的方式预测某种性质,因为无关或冗余特征的存在会导致较差的泛化能力。在本文中,提出了一种基于随机森林的替代选择方法,用于确定 QSPR 回归问题中的变量重要性,并将其应用于手动整理的数据集,以预测标准生成焓。随后的预测模型使用支持向量机进行训练,从基于变量重要性的排序列表中依次引入变量。

结果

即使在高维数据集和高度相关变量的存在下,该模型也能很好地概括。特征选择步骤表明,与没有特征选择相比,RMSE 值降低了 23%,预测误差更低,尽管仅使用了总变量数的 6%(1485 个原始变量中的 89 个)。该方法还进一步优于其他特征选择方法和特征空间降维。使用 10 折交叉验证程序选择预测模型,然后使用独立集对其进行验证,以评估其在新数据上的性能,结果与训练集的结果相似,支持所提出方法的稳健性。

结论

该方法似乎通过使用有限数量的分子描述符来提高碳氢化合物标准生成焓的预测性能,通过减少描述符的数量来加快和降低计算成本,并更好地理解描述符所表示的分子结构与感兴趣的性质之间的潜在关系。

相似文献

1
Random forests for feature selection in QSPR Models - an application for predicting standard enthalpy of formation of hydrocarbons.随机森林在 QSPR 模型中的特征选择 - 预测碳氢化合物标准生成焓的应用。
J Cheminform. 2013 Feb 11;5(1):9. doi: 10.1186/1758-2946-5-9.
2
Genetic Algorithm and Self-Organizing Maps for QSPR Study of Some N-aryl Derivatives as Butyrylcholinesterase Inhibitors.用于某些N-芳基衍生物作为丁酰胆碱酯酶抑制剂的定量构效关系研究的遗传算法和自组织映射
Curr Drug Discov Technol. 2016;13(4):232-253. doi: 10.2174/1570163813666160725114241.
3
An automated framework for QSAR model building.一种用于定量构效关系(QSAR)模型构建的自动化框架。
J Cheminform. 2018 Jan 16;10(1):1. doi: 10.1186/s13321-017-0256-5.
4
Application of quantitative structure-property relationship analysis to estimate the vapor pressure of pesticides.应用定量构效关系分析估算农药的蒸气压。
Ecotoxicol Environ Saf. 2016 Jun;128:52-60. doi: 10.1016/j.ecoenv.2016.01.020. Epub 2016 Feb 16.
5
Random KNN feature selection - a fast and stable alternative to Random Forests.随机近邻特征选择 - 一种比随机森林更快更稳定的替代方法。
BMC Bioinformatics. 2011 Nov 18;12:450. doi: 10.1186/1471-2105-12-450.
6
Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction.使用人工蚁群进行同步特征选择和参数优化:熔点预测案例研究
Chem Cent J. 2008 Oct 29;2:21. doi: 10.1186/1752-153X-2-21.
7
Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO.与套索变量选择相比,递归随机森林具有更好的预测性能和模型解释能力。
J Chem Inf Model. 2015 Apr 27;55(4):736-46. doi: 10.1021/ci500715e. Epub 2015 Mar 16.
8
QSPR models for half-wave reduction potential of steroids: a comparative study between feature selection and feature extraction from subsets of or entire set of descriptors.类固醇半波还原电位的定量构效关系(QSPR)模型:描述符子集或整个描述符集的特征选择与特征提取之间的比较研究
Anal Chim Acta. 2009 Feb 16;634(1):27-35. doi: 10.1016/j.aca.2008.11.062. Epub 2008 Dec 6.
9
Quantitative structure-retention relationship for the Kovats retention indices of a large set of terpenes: a combined data splitting-feature selection strategy.一大组萜类化合物的科瓦茨保留指数的定量结构-保留关系:一种组合数据拆分-特征选择策略
Anal Chim Acta. 2007 May 29;592(1):72-81. doi: 10.1016/j.aca.2007.04.009. Epub 2007 Apr 8.
10
Pre-processing feature selection for improved C&RT models for oral absorption.预处理特征选择可提高口服吸收的 C&RT 模型。
J Chem Inf Model. 2013 Oct 28;53(10):2730-42. doi: 10.1021/ci400378j. Epub 2013 Oct 9.

引用本文的文献

1
Scaffold and SAR studies on c-MET inhibitors using machine learning approaches.使用机器学习方法对c-MET抑制剂进行支架和构效关系研究。
J Pharm Anal. 2025 Jun;15(6):101303. doi: 10.1016/j.jpha.2025.101303. Epub 2025 Apr 10.
2
Democratizing cheminformatics: interpretable chemical grouping using an automated KNIME workflow.化学信息学的民主化:使用自动化的KNIME工作流程进行可解释的化学分组
J Cheminform. 2024 Aug 16;16(1):101. doi: 10.1186/s13321-024-00894-1.
3
Multi-Targeting Approach in Glioblastoma Using Computer-Assisted Drug Discovery Tools to Overcome the Blood-Brain Barrier and Target EGFR/PI3Kp110β Signaling.

本文引用的文献

1
Best Practices for QSAR Model Development, Validation, and Exploitation.定量构效关系(QSAR)模型开发、验证及应用的最佳实践
Mol Inform. 2010 Jul 12;29(6-7):476-88. doi: 10.1002/minf.201000061. Epub 2010 Jul 6.
2
Open Babel: An open chemical toolbox.Open Babel:一个开放的化学工具箱。
J Cheminform. 2011 Oct 7;3:33. doi: 10.1186/1758-2946-3-33.
3
An evaluation of R2 as an inadequate measure for nonlinear models in pharmacological and biochemical research: a Monte Carlo approach.R2作为药理学和生化研究中非线性模型的不充分度量的评估:一种蒙特卡罗方法。
使用计算机辅助药物发现工具在胶质母细胞瘤中采用多靶点方法以克服血脑屏障并靶向表皮生长因子受体/磷脂酰肌醇-3激酶p110β信号通路
Cancers (Basel). 2022 Jul 19;14(14):3506. doi: 10.3390/cancers14143506.
4
A Clinical Decision Support System for Diabetes Patients with Deep Learning: Experience of a Taiwan Medical Center.基于深度学习的糖尿病患者临床决策支持系统:台湾某医学中心的经验。
Int J Med Sci. 2022 Jun 13;19(6):1049-1055. doi: 10.7150/ijms.71341. eCollection 2022.
5
A survey on computational taste predictors.关于计算味觉预测器的一项调查。
Eur Food Res Technol. 2022;248(9):2215-2235. doi: 10.1007/s00217-022-04044-5. Epub 2022 May 26.
6
DeepGraphMolGen, a multi-objective, computational strategy for generating molecules with desirable properties: a graph convolution and reinforcement learning approach.深度图分子生成,一种用于生成具有理想性质分子的多目标计算策略:一种图卷积和强化学习方法。
J Cheminform. 2020 Sep 4;12(1):53. doi: 10.1186/s13321-020-00454-3.
7
Comparison and improvement of the predictability and interpretability with ensemble learning models in QSPR applications.定量构效关系(QSPR)应用中集成学习模型预测性和可解释性的比较与改进
J Cheminform. 2020 Mar 30;12(1):19. doi: 10.1186/s13321-020-0417-9.
8
Machine Learning Techniques for Soybean Charcoal Rot Disease Prediction.用于大豆炭腐病预测的机器学习技术
Front Plant Sci. 2020 Dec 14;11:590529. doi: 10.3389/fpls.2020.590529. eCollection 2020.
9
Analysis and Comparison of Vector Space and Metric Space Representations in QSAR Modeling.QSAR 建模中向量空间和度量空间表示的分析与比较。
Molecules. 2019 Apr 30;24(9):1698. doi: 10.3390/molecules24091698.
10
e-Bitter: Bitterant Prediction by the Consensus Voting From the Machine-Learning Methods.电子苦味:通过机器学习方法的共识投票进行苦味剂预测。
Front Chem. 2018 Mar 29;6:82. doi: 10.3389/fchem.2018.00082. eCollection 2018.
BMC Pharmacol. 2010 Jun 7;10:6. doi: 10.1186/1471-2210-10-6.
4
Current mathematical methods used in QSAR/QSPR studies.当前在定量构效关系(QSAR)/定量构性关系(QSPR)研究中使用的数学方法。
Int J Mol Sci. 2009 Apr 29;10(5):1978-1998. doi: 10.3390/ijms10051978.
5
How not to develop a quantitative structure-activity or structure-property relationship (QSAR/QSPR).如何避免构建定量构效关系或构性关系(QSAR/QSPR)。
SAR QSAR Environ Res. 2009;20(3-4):241-66. doi: 10.1080/10629360902949567.
6
Variable selection methods in QSAR: an overview.定量构效关系中的变量选择方法:综述
Curr Top Med Chem. 2008;8(18):1606-27. doi: 10.2174/156802608786786552.
7
A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification.基于微阵列的癌症分类中随机森林与支持向量机的全面比较
BMC Bioinformatics. 2008 Jul 22;9:319. doi: 10.1186/1471-2105-9-319.
8
Conditional variable importance for random forests.随机森林的条件变量重要性
BMC Bioinformatics. 2008 Jul 11;9:307. doi: 10.1186/1471-2105-9-307.
9
Predictive QSAR modeling workflow, model applicability domains, and virtual screening.预测性定量构效关系(QSAR)建模工作流程、模型适用域及虚拟筛选。
Curr Pharm Des. 2007;13(34):3494-504. doi: 10.2174/138161207782794257.
10
Ensemble feature selection: consistent descriptor subsets for multiple QSAR models.集成特征选择:多个定量构效关系模型的一致描述符子集
J Chem Inf Model. 2007 May-Jun;47(3):989-97. doi: 10.1021/ci600563w. Epub 2007 Apr 4.