MASSA算法：用于定量构效关系建模的训练集和测试子集的自动合理抽样。

MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling.

作者信息

Veríssimo Gabriel Corrêa, Pantaleão Simone Queiroz, Fernandes Philipe de Olveira, Gertrudes Jadson Castro, Kronenberger Thales, Honorio Kathia Maria, Maltarollo Vinícius Gonçalves

机构信息

Department of Pharmaceutical Products, Faculty of Pharmacy, Federal University of Minas Gerais, Belo Horizonte, MG, 31270-901, Brazil.

Federal University of ABC, Santo André, SP, 09210-170, Brazil.

出版信息

J Comput Aided Mol Des. 2023 Dec;37(12):735-754. doi: 10.1007/s10822-023-00536-y. Epub 2023 Oct 7.

DOI:10.1007/s10822-023-00536-y

PMID:37804393

Abstract

QSAR models capable of predicting biological, toxicity, and pharmacokinetic properties were widely used to search lead bioactive molecules in chemical databases. The dataset's preparation to build these models has a strong influence on the quality of the generated models, and sampling requires that the original dataset be divided into training (for model training) and test (for statistical evaluation) sets. This sampling can be done randomly or rationally, but the rational division is superior. In this paper, we present MASSA, a Python tool that can be used to automatically sample datasets by exploring the biological, physicochemical, and structural spaces of molecules using PCA, HCA, and K-modes. The proposed algorithm is very useful when the variables used for QSAR are not available or to construct multiple QSAR models with the same training and test sets, producing models with lower variability and better values for validation metrics. These results were obtained even when the descriptors used in the QSAR/QSPR were different from those used in the separation of training and test sets, indicating that this tool can be used to build models for more than one QSAR/QSPR technique. Finally, this tool also generates useful graphical representations that can provide insights into the data.

摘要

能够预测生物、毒性和药代动力学性质的定量构效关系（QSAR）模型被广泛用于在化学数据库中搜索先导生物活性分子。构建这些模型时数据集的准备对所生成模型的质量有很大影响，并且采样要求将原始数据集划分为训练集（用于模型训练）和测试集（用于统计评估）。这种采样可以随机进行或合理进行，但合理划分更具优势。在本文中，我们介绍了MASSA，这是一个Python工具，可通过使用主成分分析（PCA）、层次聚类分析（HCA）和K-模式探索分子的生物、物理化学和结构空间来自动对数据集进行采样。当用于QSAR的变量不可用时，或者要使用相同的训练集和测试集构建多个QSAR模型时，所提出的算法非常有用，它能生成变异性更低且验证指标值更好的模型。即使QSAR/定量结构-性质关系（QSPR）中使用的描述符与训练集和测试集划分中使用的描述符不同，也能获得这些结果，这表明该工具可用于为多种QSAR/QSPR技术构建模型。最后，该工具还会生成有用的图形表示，可提供对数据的深入了解。

相似文献

MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling.MASSA算法：用于定量构效关系建模的训练集和测试子集的自动合理抽样。

J Comput Aided Mol Des. 2023 Dec;37(12):735-754. doi: 10.1007/s10822-023-00536-y. Epub 2023 Oct 7.

Does rational selection of training and test sets improve the outcome of QSAR modeling?训练集和测试集的合理选择是否能提高 QSAR 建模的结果？

J Chem Inf Model. 2012 Oct 22;52(10):2570-8. doi: 10.1021/ci300338w. Epub 2012 Oct 3.

Exploring the QSAR's predictive truthfulness of the novel N-tuple discrete derivative indices on benchmark datasets.探索新型N元组离散导数指标在基准数据集上的定量构效关系（QSAR）预测真实性。

SAR QSAR Environ Res. 2017 May;28(5):367-389. doi: 10.1080/1062936X.2017.1326403.

Rational selection of training and test sets for the development of validated QSAR models.为开发经过验证的定量构效关系（QSAR）模型合理选择训练集和测试集。

J Comput Aided Mol Des. 2003 Feb-Apr;17(2-4):241-53. doi: 10.1023/a:1025386326946.

Evaluation of QSAR Equations for Virtual Screening.QSAR 方程在虚拟筛选中的评估。

Int J Mol Sci. 2020 Oct 22;21(21):7828. doi: 10.3390/ijms21217828.

Application of GA-MLR for QSAR Modeling of the Arylthioindole Class of Tubulin Polymerization Inhibitors as Anticancer Agents.遗传算法-多元线性回归在作为抗癌剂的芳基硫代吲哚类微管蛋白聚合抑制剂定量构效关系建模中的应用。

Anticancer Agents Med Chem. 2017;17(4):552-565. doi: 10.2174/1871520616666160811162105.

Genetic Algorithm and Self-Organizing Maps for QSPR Study of Some N-aryl Derivatives as Butyrylcholinesterase Inhibitors.用于某些N-芳基衍生物作为丁酰胆碱酯酶抑制剂的定量构效关系研究的遗传算法和自组织映射

Curr Drug Discov Technol. 2016;13(4):232-253. doi: 10.2174/1570163813666160725114241.

PV-Based Training Set Selection Improves the External Predictability of QSAR/QSPR Models.基于光伏的训练集选择提高了QSAR/QSPR模型的外部预测能力。

J Chem Inf Model. 2017 May 22;57(5):1055-1067. doi: 10.1021/acs.jcim.7b00029. Epub 2017 Apr 27.

OPERA models for predicting physicochemical properties and environmental fate endpoints.用于预测物理化学性质和环境归宿终点的OPERA模型。

J Cheminform. 2018 Mar 8;10(1):10. doi: 10.1186/s13321-018-0263-1.

QSAR Modeling of the Arylthioindole Class of Colchicine Polymerization Inhibitors as Anticancer Agents.秋水仙碱聚合抑制剂类芳基硫代吲哚作为抗癌剂的定量构效关系建模

Curr Comput Aided Drug Des. 2017;13(2):143-159. doi: 10.2174/1573409913666170124100810.

引用本文的文献

Understanding the Enzyme ()-Norcoclaurine Synthase Promiscuity to Aldehydes and Ketones.理解酶（）-诺卡屈嗪合成酶对醛和酮的混杂性。

J Chem Inf Model. 2024 Jun 10;64(11):4462-4474. doi: 10.1021/acs.jcim.3c01773. Epub 2024 May 22.

Machine Learning-Based Virtual Screening of Antibacterial Agents against Methicillin-Susceptible and Resistant .基于机器学习的耐甲氧西林敏感和耐药金黄色葡萄球菌抗菌药物虚拟筛选。

J Chem Inf Model. 2024 Mar 25;64(6):1932-1944. doi: 10.1021/acs.jcim.4c00087. Epub 2024 Mar 4.

本文引用的文献

Designing drugs when there is low data availability: one-shot learning and other approaches to face the issues of a long-term concern.在数据可用性较低时设计药物：一次性学习及应对长期关注问题的其他方法。

Expert Opin Drug Discov. 2022 Sep;17(9):929-947. doi: 10.1080/17460441.2022.2114451. Epub 2022 Aug 30.

Machine Learning in Drug Discovery: A Review.药物发现中的机器学习：综述

Artif Intell Rev. 2022;55(3):1947-1999. doi: 10.1007/s10462-021-10058-4. Epub 2021 Aug 11.

Molecular insights on ABL kinase activation using tree-based machine learning models and molecular docking.基于树的机器学习模型和分子对接的 ABL 激酶激活的分子见解。

Mol Divers. 2021 Aug;25(3):1301-1314. doi: 10.1007/s11030-021-10261-z. Epub 2021 Jun 30.

Quantitative structure-activity relationship and machine learning studies of 2-thiazolylhydrazone derivatives with anti- activity.具有抗活性的2-噻唑基腙衍生物的定量构效关系及机器学习研究

J Biomol Struct Dyn. 2022;40(20):9789-9800. doi: 10.1080/07391102.2021.1935321. Epub 2021 Jun 14.

QSAR-Co-X: an open source toolkit for multitarget QSAR modelling.QSAR-Co-X：用于多靶点定量构效关系建模的开源工具包。

J Cheminform. 2021 Apr 15;13(1):29. doi: 10.1186/s13321-021-00508-0.

QSAR without borders.无边界定量构效关系。

Chem Soc Rev. 2020 Jun 7;49(11):3525-3564. doi: 10.1039/d0cs00098a. Epub 2020 May 1.

SciPy 1.0: fundamental algorithms for scientific computing in Python.SciPy 1.0：Python 中的科学计算基础算法。

Nat Methods. 2020 Mar;17(3):261-272. doi: 10.1038/s41592-019-0686-2. Epub 2020 Feb 3.

New Workflow for QSAR Model Development from Small Data Sets: Small Dataset Curator and Small Dataset Modeler. Integration of Data Curation, Exhaustive Double Cross-Validation, and a Set of Optimal Model Selection Techniques.从少量数据集开发定量构效关系模型的新工作流程：少量数据集整理员和少量数据集建模师。数据整理、全面双重交叉验证以及一系列最佳模型选择技术的集成。

J Chem Inf Model. 2019 Oct 28;59(10):4070-4076. doi: 10.1021/acs.jcim.9b00476. Epub 2019 Sep 26.

Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery.人工智能在计算机辅助药物发现中的概念。

Chem Rev. 2019 Sep 25;119(18):10520-10594. doi: 10.1021/acs.chemrev.8b00728. Epub 2019 Jul 11.

HQSAR and random forest-based QSAR models for anti-T. vaginalis activities of nitroimidazoles derivatives.基于HQSAR和随机森林的硝基咪唑衍生物抗阴道毛滴虫活性QSAR模型

J Mol Graph Model. 2019 Jul;90:180-191. doi: 10.1016/j.jmgm.2019.04.007. Epub 2019 Apr 19.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

MASSA算法：用于定量构效关系建模的训练集和测试子集的自动合理抽样。

MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献