Abrahamsson Dimitri, Park June-Soo, Singh Randolph R, Sirota Marina, Woodruff Tracey J
Program on Reproductive Health and the Environment, Department of Obstetrics and Gynecology, University of California, San Francisco, California 94158, United States.
Environmental Chemistry Laboratory, California Department of Toxic Substances Control, Berkeley, California 94710, United States.
J Chem Inf Model. 2020 Jun 22;60(6):2718-2727. doi: 10.1021/acs.jcim.9b01096. Epub 2020 May 20.
Non-targeted analysis provides a comprehensive approach to analyze environmental and biological samples for nearly all chemicals present. One of the main shortcomings of current analytical methods and workflows is that they are unable to provide any quantitative information constituting an important obstacle in understanding environmental fate and human exposure. Herein, we present an in silico quantification method using mahine-learning for chemicals analyzed using electrospray ionization (ESI). We considered three data sets from different instrumental setups: (i) capillary electrophoresis electrospray ionization-mass spectrometry (CE-MS) in positive ionization mode (ESI+), (ii) liquid chromatography quadrupole time-of-flight mass spectrometry (LC-QTOF/MS) in ESI+ and (iii) LC-QTOF/MS in negative ionization mode (ESI-). We developed and applied two different machine-learning algorithms: a random forest (RF) and an artificial neural network (ANN) to predict the relative response factors (RRFs) of different chemicals based on their physicochemical properties. Chemical concentrations can then be calculated by dividing the measured abundance of a chemical, as peak area or peak height, by its corresponding RRF. We evaluated our models and tested their predictive power using 5-fold cross-validation (CV) and randomization. Both the RF and the ANN models showed great promise in predicting RRFs. However, the accuracy of the predictions was dependent on the data set composition and the experimental setup. For the CE-MS ESI+ data set, the best model predicted measured RRFs with a mean absolute error (MAE) of 0.19 log units and a cross-validation coefficient of determination () of 0.84 for the testing set. For the LC-QTOF/MS ESI+ data set, the best model predicted measured RRFs with an MAE of 0.32 and a of 0.40. For the LC-QTOF/MS ESI- data set, the best model predicted measured RRFs with a MAE of 0.50 and a of 0.20. Our findings suggest that machine-learning algorithms can be used for predicting concentrations of nontargeted chemicals with reasonable uncertainties, especially in ESI+, while the application on ESI- remains a more challenging problem.
非靶向分析提供了一种全面的方法,可用于分析环境和生物样品中几乎所有存在的化学物质。当前分析方法和工作流程的主要缺点之一是它们无法提供任何定量信息,这构成了理解环境归宿和人类暴露情况的一个重要障碍。在此,我们提出了一种基于机器学习的计算机模拟定量方法,用于对采用电喷雾电离(ESI)分析的化学物质进行定量。我们考虑了来自不同仪器设置的三个数据集:(i)正离子模式(ESI+)下的毛细管电泳电喷雾电离质谱(CE-MS),(ii)ESI+模式下的液相色谱四极杆飞行时间质谱(LC-QTOF/MS),以及(iii)负离子模式(ESI-)下的LC-QTOF/MS。我们开发并应用了两种不同的机器学习算法:随机森林(RF)和人工神经网络(ANN),以根据不同化学物质的物理化学性质预测其相对响应因子(RRF)。然后,可以通过将化学物质的测量丰度(以峰面积或峰高表示)除以其相应的RRF来计算化学物质的浓度。我们使用五折交叉验证(CV)和随机化评估了我们的模型并测试了它们的预测能力。RF模型和ANN模型在预测RRF方面都显示出了很大的潜力。然而,预测的准确性取决于数据集的组成和实验设置。对于CE-MS ESI+数据集,最佳模型预测的测量RRF的平均绝对误差(MAE)为0.19对数单位,测试集的交叉验证决定系数()为0.84。对于LC-QTOF/MS ESI+数据集,最佳模型预测的测量RRF的MAE为0.32,为0.40。对于LC-QTOF/MS ESI-数据集,最佳模型预测的测量RRF的MAE为0.50,为0.20。我们的研究结果表明,机器学习算法可用于以合理的不确定性预测非靶向化学物质的浓度,尤其是在ESI+模式下,而在ESI-模式下的应用仍然是一个更具挑战性的问题。