Mansouri Kamel, Grulke Chris M, Judson Richard S, Williams Antony J
National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA.
Oak Ridge Institute for Science and Education, 1299 Bethel Valley Road, Oak Ridge, TN, 37830, USA.
J Cheminform. 2018 Mar 8;10(1):10. doi: 10.1186/s13321-018-0263-1.
The collection of chemical structure information and associated experimental data for quantitative structure-activity/property relationship (QSAR/QSPR) modeling is facilitated by an increasing number of public databases containing large amounts of useful data. However, the performance of QSAR models highly depends on the quality of the data and modeling methodology used. This study aims to develop robust QSAR/QSPR models for chemical properties of environmental interest that can be used for regulatory purposes. This study primarily uses data from the publicly available PHYSPROP database consisting of a set of 13 common physicochemical and environmental fate properties. These datasets have undergone extensive curation using an automated workflow to select only high-quality data, and the chemical structures were standardized prior to calculation of the molecular descriptors. The modeling procedure was developed based on the five Organization for Economic Cooperation and Development (OECD) principles for QSAR models. A weighted k-nearest neighbor approach was adopted using a minimum number of required descriptors calculated using PaDEL, an open-source software. The genetic algorithms selected only the most pertinent and mechanistically interpretable descriptors (2-15, with an average of 11 descriptors). The sizes of the modeled datasets varied from 150 chemicals for biodegradability half-life to 14,050 chemicals for logP, with an average of 3222 chemicals across all endpoints. The optimal models were built on randomly selected training sets (75%) and validated using fivefold cross-validation (CV) and test sets (25%). The CV Q of the models varied from 0.72 to 0.95, with an average of 0.86 and an R test value from 0.71 to 0.96, with an average of 0.82. Modeling and performance details are described in QSAR model reporting format and were validated by the European Commission's Joint Research Center to be OECD compliant. All models are freely available as an open-source, command-line application called OPEn structure-activity/property Relationship App (OPERA). OPERA models were applied to more than 750,000 chemicals to produce freely available predicted data on the U.S. Environmental Protection Agency's CompTox Chemistry Dashboard.
越来越多包含大量有用数据的公共数据库,为定量构效/构性关系(QSAR/QSPR)建模的化学结构信息及相关实验数据收集提供了便利。然而,QSAR模型的性能高度依赖于所使用的数据质量和建模方法。本研究旨在开发用于环境相关化学性质的稳健QSAR/QSPR模型,以供监管使用。本研究主要使用来自公开可用的PHYSPROP数据库的数据,该数据库包含一组13种常见的物理化学和环境归宿性质。这些数据集使用自动化工作流程进行了广泛整理,以仅选择高质量数据,并且在计算分子描述符之前对化学结构进行了标准化。建模程序是根据经济合作与发展组织(OECD)的五项QSAR模型原则开发的。采用加权k近邻方法,使用开源软件PaDEL计算所需的最少描述符数量。遗传算法仅选择最相关且具有机理解释性的描述符(2 - 15个,平均11个描述符)。建模数据集的大小从用于生物降解半衰期的150种化学品到用于logP的14,050种化学品不等,所有端点的平均数量为3222种化学品。最优模型基于随机选择的训练集(75%)构建,并使用五重交叉验证(CV)和测试集(25%)进行验证。模型的CV Q值从0.72到0.95不等,平均为0.86,R测试值从0.71到0.96不等,平均为0.82。建模和性能细节以QSAR模型报告格式描述,并经欧盟委员会联合研究中心验证符合OECD标准。所有模型均可作为名为开放结构活性/性质关系应用程序(OPERA)的开源命令行应用程序免费获取。OPERA模型应用于超过750,000种化学品,以在美国环境保护局的综合毒性化学仪表板上生成免费的预测数据。