Center for Computational Toxicology and Exposure, Office of Research and Development, United States Environmental Protection Agency, Research Triangle Park, North Carolina 27711, United States.
ORAU Student Services Contractor to Center for Computational Toxicology and Exposure, Office of Research and Development, United States Environmental Protection Agency, Research Triangle Park, North Carolina 27711, United States.
Chem Res Toxicol. 2023 Mar 20;36(3):465-478. doi: 10.1021/acs.chemrestox.2c00379. Epub 2023 Mar 6.
The need for careful assembly, training, and validation of quantitative structure-activity/property models (QSAR/QSPR) is more significant than ever as data sets become larger and sophisticated machine learning tools become increasingly ubiquitous and accessible to the scientific community. Regulatory agencies such as the United States Environmental Protection Agency must carefully scrutinize each aspect of a resulting QSAR/QSPR model to determine its potential use in environmental exposure and hazard assessment. Herein, we revisit the goals of the Organisation for Economic Cooperation and Development (OECD) in our application and discuss the validation principles for structure-activity models. We apply these principles to a model for predicting water solubility of organic compounds derived using random forest regression, a common machine learning approach in the QSA/PR literature. Using public sources, we carefully assembled and curated a data set consisting of 10,200 unique chemical structures with associated water solubility measurements. This data set was then used as a focal narrative to methodically consider the OECD's QSA/PR principles and how they can be applied to random forests. Despite some expert, mechanistically informed supervision of descriptor selection to enhance model interpretability, we achieved a model of water solubility with comparable performance to previously published models (5-fold cross validated performance 0.81 and 0.98 RMSE). We hope this work will catalyze a necessary conversation around the importance of cautiously modernizing and explicitly leveraging OECD principles while pursuing state-of-the-art machine learning approaches to derive QSA/PR models suitable for regulatory consideration.
随着数据集的增大和越来越多的机器学习工具变得普及并易于科学界使用,精心组装、培训和验证定量结构-活性/性质模型 (QSAR/QSPR) 比以往任何时候都更加重要。监管机构,如美国环境保护署,必须仔细审查 QSAR/QSPR 模型的各个方面,以确定其在环境暴露和危害评估中的潜在用途。在此,我们重新审视经济合作与发展组织 (OECD) 的目标在我们的应用程序中,并讨论结构活性模型的验证原则。我们将这些原则应用于使用随机森林回归得出的预测有机化合物水溶性的模型,这是 QSA/PR 文献中常用的机器学习方法。我们使用公共资源精心组装和策划了一个包含 10200 个独特化学结构和相关水溶性测量值的数据集。然后,该数据集被用作焦点叙述,系统地考虑 OECD 的 QSA/PR 原则以及如何将它们应用于随机森林。尽管在增强模型可解释性方面有专家的机制信息指导选择描述符,但我们还是实现了与之前发表的模型相当的水溶性模型性能(5 倍交叉验证性能为 0.81 和 0.98 RMSE)。我们希望这项工作将引发一场必要的对话,讨论在追求最先进的机器学习方法以得出适合监管考虑的 QSA/PR 模型时,谨慎地实现现代化和明确利用 OECD 原则的重要性。