Duke University, Department of Civil and Environmental Engineering, 121 Hudson Hall, Durham, NC 27708, USA; Center for the Environmental Implications of NanoTechnology (CEINT), USA.
Duke University, Department of Civil and Environmental Engineering, 121 Hudson Hall, Durham, NC 27708, USA; Duke University, Department of Biostatistics and Bioinformatics, Duke University Medical Center, 2424 Erwin Road, Suite 1102 Hock Plaza, Durham, NC 27710, USA.
Sci Total Environ. 2022 Aug 15;834:154849. doi: 10.1016/j.scitotenv.2022.154849. Epub 2022 Apr 8.
Chemical ingredients in consumer products are continually changing. To understand our exposure to chemicals and their consequent risk, we need to know their concentrations in products, or chemical weight fractions. Unfortunately, manufacturers rarely report comprehensive weight fraction data on product labels. The goal of this study was to evaluate the utility of machine learning strategies for predicting weight fractions when chemical constituent data are limited. A "data-poor" framework was developed and tested using a small dataset on consumer products containing engineered nanomaterials to represent emerging substances. A second, more traditional framework was applied to a "data-rich" product dataset comprised of bulk-scale organic chemicals for comparison purposes. Feature variables included chemical properties, functional use categories (e.g., antimicrobial), product categories (e.g., makeup), product matrix categories, and whether weight fractions were manufacturer-reported or experimentally obtained. Classification into three weight fraction bins was done using a random forest or nonlinear support vector classifier. An ablation study revealed that functional use data improved predictive performance when included alongside chemical property data, suggesting the utility of functional use categories in evaluating the safety and sustainability of emerging chemicals. Models could roughly stratify material-product observations into order of magnitude weight fractions with moderate success; the best of these achieved an average balanced accuracy of 73% on the nanomaterials product data. Framework comparisons also revealed a positive trend in sample size versus average balanced accuracy, suggesting great promise for machine learning approaches with continued investment in chemical data collection.
消费品中的化学物质成分在不断变化。为了了解我们接触的化学物质及其带来的风险,我们需要知道它们在产品中的浓度,即化学物质的重量分数。遗憾的是,制造商在产品标签上很少报告全面的重量分数数据。本研究的目的是评估在化学物质成分数据有限的情况下,使用机器学习策略预测重量分数的实用性。我们开发并测试了一个“数据匮乏”的框架,该框架使用了一个包含工程纳米材料的消费品的小型数据集,以代表新兴物质。第二个更传统的框架则应用于一个“数据丰富”的产品数据集,该数据集包含了大量的有机化学品,用于比较目的。特征变量包括化学性质、功能用途类别(如抗菌)、产品类别(如化妆品)、产品基质类别,以及重量分数是制造商报告的还是通过实验获得的。使用随机森林或非线性支持向量分类器将分类为三个重量分数箱。一项消融研究表明,当将功能用途数据与化学性质数据一起使用时,可以提高预测性能,这表明功能用途类别在评估新兴化学物质的安全性和可持续性方面具有一定的实用性。模型可以大致按照重量分数的数量级对材料-产品观测值进行分层,其中最好的模型在纳米材料产品数据上的平均平衡准确率为 73%。框架比较还显示了样本大小与平均平衡准确率之间的正相关趋势,这表明随着对化学数据收集的持续投资,机器学习方法具有很大的发展潜力。