Department of Civil and Environmental Engineering, Case Western Reserve University, Cleveland, Ohio 44106, United States.
Environ Sci Technol. 2022 Sep 6;56(17):12755-12764. doi: 10.1021/acs.est.2c01764. Epub 2022 Aug 16.
Machine learning (ML) is viewed as a promising tool for the prediction of aerobic biodegradation, one of the most important elimination pathways of organic chemicals from the environment. However, available models only have small datasets (<3200 records), make binary classification predictions, evaluate ready biodegradability, and do not incorporate experimental conditions (e.g., system setup and reaction time). This study addressed all these limitations by first compiling a large database of 12,750 records, considering both ready and inherent biodegradation under different conditions, and then developing regression and classification models using different chemical representations and ML algorithms. The best regression model ( = 0.54 and root mean square error of 0.25) and classification model (the prediction accuracy from 85.1%) achieved very good performance. The model interpretation indicated that the models correctly captured the effects of chemical substructures, following the order of C═O > O═C-O > OH > CH > halogen > branching > N > 6-member ring. The consideration of chemical speciation based on p and α notations did not affect the regression model performance but significantly improved the classification model performance (the accuracy increased to 87.6%). The models also showed large applicability domains and provided reasonable predictions for more than 98% of over 850,000 environmentally relevant chemicals in the Distributed Structure-Searchable Toxicity database. These robust, trustable models were finally made widely accessible through two free online predictors with graphical user interface.
机器学习(ML)被视为预测有机化学品在环境中好氧生物降解的一种很有前途的工具,好氧生物降解是有机化学物质最重要的消除途径之一。然而,现有的模型仅具有较小的数据集(<3200 条记录),进行二进制分类预测,评估易生物降解性,并且不包含实验条件(例如,系统设置和反应时间)。本研究通过首先编译一个包含 12750 条记录的大型数据库来解决所有这些限制,该数据库考虑了不同条件下的易生物降解和固有生物降解,然后使用不同的化学表示法和 ML 算法开发回归和分类模型。最佳回归模型(=0.54,均方根误差为 0.25)和分类模型(预测准确率为 85.1%)的性能非常出色。模型解释表明,这些模型正确地捕捉到了化学亚结构的影响,其顺序为 C=O>O=C-O>OH>CH>卤素>支链> N>6 元环。基于 p 和 α 符号的化学形态考虑并没有影响回归模型的性能,但却显著提高了分类模型的性能(准确性提高到 87.6%)。这些模型还显示出较大的适用域,并对分布结构可搜索毒性数据库中超过 850000 种与环境相关的化学物质中的 98%以上提供了合理的预测。这些稳健、可信的模型最终通过两个带有图形用户界面的免费在线预测器广泛可用。