用于化学结构自动标准化以支持定量构效关系建模的免费开源且适用于定量构效关系的工作流程。

Free and open-source QSAR-ready workflow for automated standardization of chemical structures in support of QSAR modeling.

作者信息

Mansouri Kamel, Moreira-Filho José T, Lowe Charles N, Charest Nathaniel, Martin Todd, Tkachenko Valery, Judson Richard, Conway Mike, Kleinstreuer Nicole C, Williams Antony J

机构信息

National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA.

Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA.

出版信息

J Cheminform. 2024 Feb 20;16(1):19. doi: 10.1186/s13321-024-00814-3.

DOI:10.1186/s13321-024-00814-3

PMID:38378618

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10880251/

Abstract

The rapid increase of publicly available chemical structures and associated experimental data presents a valuable opportunity to build robust QSAR models for applications in different fields. However, the common concern is the quality of both the chemical structure information and associated experimental data. This is especially true when those data are collected from multiple sources as chemical substance mappings can contain many duplicate structures and molecular inconsistencies. Such issues can impact the resulting molecular descriptors and their mappings to experimental data and, subsequently, the quality of the derived models in terms of accuracy, repeatability, and reliability. Herein we describe the development of an automated workflow to standardize chemical structures according to a set of standard rules and generate two and/or three-dimensional "QSAR-ready" forms prior to the calculation of molecular descriptors. The workflow was designed in the KNIME workflow environment and consists of three high-level steps. First, a structure encoding is read, and then the resulting in-memory representation is cross-referenced with any existing identifiers for consistency. Finally, the structure is standardized using a series of operations including desalting, stripping of stereochemistry (for two-dimensional structures), standardization of tautomers and nitro groups, valence correction, neutralization when possible, and then removal of duplicates. This workflow was initially developed to support collaborative modeling QSAR projects to ensure consistency of the results from the different participants. It was then updated and generalized for other modeling applications. This included modification of the "QSAR-ready" workflow to generate "MS-ready structures" to support the generation of substance mappings and searches for software applications related to non-targeted analysis mass spectrometry. Both QSAR and MS-ready workflows are freely available in KNIME, via standalone versions on GitHub, and as docker container resources for the scientific community. Scientific contribution: This work pioneers an automated workflow in KNIME, systematically standardizing chemical structures to ensure their readiness for QSAR modeling and broader scientific applications. By addressing data quality concerns through desalting, stereochemistry stripping, and normalization, it optimizes molecular descriptors' accuracy and reliability. The freely available resources in KNIME, GitHub, and docker containers democratize access, benefiting collaborative research and advancing diverse modeling endeavors in chemistry and mass spectrometry.

摘要

公开可用的化学结构和相关实验数据的快速增长为构建适用于不同领域的稳健定量构效关系（QSAR）模型提供了宝贵机会。然而，普遍关注的是化学结构信息和相关实验数据的质量。当这些数据从多个来源收集时尤其如此，因为化学物质映射可能包含许多重复结构和分子不一致性。此类问题会影响所得的分子描述符及其与实验数据的映射，进而影响衍生模型在准确性、可重复性和可靠性方面的质量。在此，我们描述了一种自动化工作流程的开发，该工作流程可根据一组标准规则对化学结构进行标准化，并在计算分子描述符之前生成二维和/或三维“适用于QSAR”的形式。该工作流程是在KNIME工作流环境中设计的，由三个高级步骤组成。首先，读取结构编码，然后将所得的内存表示与任何现有的标识符进行交叉引用以确保一致性。最后，使用一系列操作对结构进行标准化，包括脱盐、去除立体化学信息（对于二维结构）、互变异构体和硝基的标准化、价态校正、尽可能进行中和，然后去除重复项。此工作流程最初是为支持合作建模QSAR项目而开发的，以确保不同参与者的结果具有一致性。然后对其进行更新并推广用于其他建模应用。这包括对“适用于QSAR”的工作流程进行修改，以生成“适用于质谱（MS）”的结构，以支持物质映射的生成以及与非靶向分析质谱相关的软件应用的搜索。QSAR和适用于MS的工作流程均可通过GitHub上的独立版本以及作为科学界的Docker容器资源在KNIME中免费获得。科学贡献：这项工作在KNIME中开创了一种自动化工作流程，系统地标准化化学结构以确保其适用于QSAR建模和更广泛的科学应用。通过脱盐、去除立体化学信息和归一化来解决数据质量问题，它优化了分子描述符的准确性和可靠性。KNIME、GitHub和Docker容器中免费提供的资源使获取变得民主化，有利于合作研究并推动化学和质谱领域的各种建模工作。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/42f7/10880251/564adc20bcbf/13321_2024_814_Fig1_HTML.jpg

相似文献

Free and open-source QSAR-ready workflow for automated standardization of chemical structures in support of QSAR modeling.用于化学结构自动标准化以支持定量构效关系建模的免费开源且适用于定量构效关系的工作流程。

J Cheminform. 2024 Feb 20;16(1):19. doi: 10.1186/s13321-024-00814-3.

An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling.一种用于解决QSAR建模中使用的公共数据集中化学错误和不一致性的自动化编目程序。

SAR QSAR Environ Res. 2016 Nov;27(11):939-965. doi: 10.1080/1062936X.2016.1253611.

A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications.一种用于建模应用的化学数据检索和质量检查的新型半自动工作流程。

J Cheminform. 2018 Dec 10;10(1):60. doi: 10.1186/s13321-018-0315-6.

Automated Workflows for Data Curation and Machine Learning to Develop Quantitative Structure-Activity Relationships.用于数据管理和机器学习的自动化工作流程以开发定量结构-活性关系。

Methods Mol Biol. 2025;2834:115-130. doi: 10.1007/978-1-0716-4003-6_5.

Examining evolutionary scale modeling-derived different-dimensional embeddings in the antimicrobial peptide classification through a KNIME workflow.通过 KNIME 工作流程检查源于抗菌肽分类的进化比例模型的不同维度嵌入。

Protein Sci. 2024 Apr;33(4):e4928. doi: 10.1002/pro.4928.

An automated framework for QSAR model building.一种用于定量构效关系（QSAR）模型构建的自动化框架。

J Cheminform. 2018 Jan 16;10(1):1. doi: 10.1186/s13321-017-0256-5.

Democratizing cheminformatics: interpretable chemical grouping using an automated KNIME workflow.化学信息学的民主化：使用自动化的KNIME工作流程进行可解释的化学分组

J Cheminform. 2024 Aug 16;16(1):101. doi: 10.1186/s13321-024-00894-1.

KNIME-CDK: Workflow-driven cheminformatics.KNIME-CDK：基于工作流的化学信息学。

BMC Bioinformatics. 2013 Aug 22;14:257. doi: 10.1186/1471-2105-14-257.

OPERA models for predicting physicochemical properties and environmental fate endpoints.用于预测物理化学性质和环境归宿终点的OPERA模型。

J Cheminform. 2018 Mar 8;10(1):10. doi: 10.1186/s13321-018-0263-1.

Open-source QSAR models for pKa prediction using multiple machine learning approaches.使用多种机器学习方法进行pKa预测的开源定量构效关系模型

J Cheminform. 2019 Sep 18;11(1):60. doi: 10.1186/s13321-019-0384-1.

引用本文的文献

How to crack a SMILES: automatic crosschecked chemical structure resolution across multiple services using MoleculeResolver.如何破解SMILES：使用分子解析器跨多个服务自动交叉核对化学结构解析

J Cheminform. 2025 Aug 4;17(1):117. doi: 10.1186/s13321-025-01064-7.

An in silico to in vivo approach identifies retinoid-X receptor activating tert-butylphenols used in food contact materials.一种从计算机模拟到体内实验的方法鉴定出用于食品接触材料中的视黄醇X受体激活型叔丁基苯酚。

Sci Rep. 2025 Jul 18;15(1):26102. doi: 10.1038/s41598-025-09244-z.

Machine Learning for Toxicity Prediction Using Chemical Structures: Pillars for Success in the Real World.利用化学结构进行毒性预测的机器学习：在现实世界中取得成功的支柱。

Chem Res Toxicol. 2025 May 19;38(5):759-807. doi: 10.1021/acs.chemrestox.5c00033. Epub 2025 May 2.

Prediction of Respiratory Irritation and Respiratory Sensitization of Chemicals Using Structural Alerts and Machine Learning Modeling.利用结构警示和机器学习建模预测化学品的呼吸道刺激和呼吸道致敏作用

Toxics. 2025 Mar 25;13(4):243. doi: 10.3390/toxics13040243.

ApisTox: a new benchmark dataset for the classification of small molecules toxicity on honey bees.蜜蜂毒素：用于小分子对蜜蜂毒性分类的新基准数据集。

Sci Data. 2025 Jan 2;12(1):5. doi: 10.1038/s41597-024-04232-w.

A Novel Machine Learning Model and a Web Portal for Predicting the Human Skin Sensitization Effects of Chemical Agents.一种用于预测化学试剂对人体皮肤致敏作用的新型机器学习模型及网络门户。

Toxics. 2024 Nov 7;12(11):803. doi: 10.3390/toxics12110803.

WWAD: the most comprehensive small molecule World Wide Approved Drug database of therapeutics.WWAD：最全面的小分子全球获批治疗药物数据库。

Front Pharmacol. 2024 Sep 18;15:1473279. doi: 10.3389/fphar.2024.1473279. eCollection 2024.

Democratizing cheminformatics: interpretable chemical grouping using an automated KNIME workflow.化学信息学的民主化：使用自动化的KNIME工作流程进行可解释的化学分组

J Cheminform. 2024 Aug 16;16(1):101. doi: 10.1186/s13321-024-00894-1.

Artificial intelligence (AI)-it's the end of the tox as we know it (and I feel fine).人工智能（AI）——这是我们所知道的毒理学的终结（我感觉很好）。

Arch Toxicol. 2024 Mar;98(3):735-754. doi: 10.1007/s00204-023-03666-2. Epub 2024 Jan 20.

本文引用的文献

Transparency in Modeling through Careful Application of OECD's QSAR/QSPR Principles via a Curated Water Solubility Data Set.通过精心应用经合组织的 QSAR/QSPR 原则并通过精心制作的水溶性数据集实现建模透明度。

Chem Res Toxicol. 2023 Mar 20;36(3):465-478. doi: 10.1021/acs.chemrestox.2c00379. Epub 2023 Mar 6.

canSAR chemistry registration and standardization pipeline.癌症小分子活性数据库化学登记与标准化流程

J Cheminform. 2022 May 28;14(1):28. doi: 10.1186/s13321-022-00606-7.

Evaluation of Variability Across Rat Acute Oral Systemic Toxicity Studies.大鼠急性经口全身毒性研究的变异性评价。

Toxicol Sci. 2022 Jun 28;188(1):34-47. doi: 10.1093/toxsci/kfac042.

The effect of noise on the predictive limit of QSAR models.噪声对定量构效关系（QSAR）模型预测极限的影响。

J Cheminform. 2021 Nov 25;13(1):92. doi: 10.1186/s13321-021-00571-7.

CATMoS: Collaborative Acute Toxicity Modeling Suite.CATMoS：协作急性毒性建模套件。

Environ Health Perspect. 2021 Apr;129(4):47013. doi: 10.1289/EHP8495. Epub 2021 Apr 30.

Enabling High-Throughput Searches for Multiple Chemical Data Using the U.S.-EPA CompTox Chemicals Dashboard.利用美国环保署 CompTox 化学品数据监测平台实现多种化学物质数据的高通量搜索。

J Chem Inf Model. 2021 Feb 22;61(2):565-570. doi: 10.1021/acs.jcim.0c01273. Epub 2021 Jan 22.

An open source chemical structure curation pipeline using RDKit.一个使用RDKit的开源化学结构编目流程。

J Cheminform. 2020 Sep 1;12(1):51. doi: 10.1186/s13321-020-00456-1.

EPA's DSSTox database: History of development of a curated chemistry resource supporting computational toxicology research.美国环境保护局的DSSTox数据库：支持计算毒理学研究的经过整理的化学资源的发展历程。

Comput Toxicol. 2019 Nov 1;12. doi: 10.1016/j.comtox.2019.100096.

Tautomer Standardization in Chemical Databases: Deriving Business Rules from Quantum Chemistry.化学数据库中的互变异构标准化：从量子化学中得出业务规则。

J Chem Inf Model. 2020 Aug 24;60(8):3781-3791. doi: 10.1021/acs.jcim.0c00232. Epub 2020 Jul 23.

An integrated chemical environment with tools for chemical safety testing.具有化学安全测试工具的综合化学环境。

Toxicol In Vitro. 2020 Sep;67:104916. doi: 10.1016/j.tiv.2020.104916. Epub 2020 Jun 14.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于化学结构自动标准化以支持定量构效关系建模的免费开源且适用于定量构效关系的工作流程。

Free and open-source QSAR-ready workflow for automated standardization of chemical structures in support of QSAR modeling.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献