McEachran Andrew D, Mansouri Kamel, Newton Seth R, Beverly Brandiese E J, Sobus Jon R, Williams Antony J
Oak Ridge Institute for Science and Education (ORISE) Research Participation Program, US Environmental Protection Agency, 109 T.W. Alexander Drive, Research Triangle Park, NC 27711, USA; National Center for Computational Toxicology, Office of Research and Development, US Environmental Protection Agency, 109 T.W. Alexander Drive, Research Triangle Park, NC 27711, USA.
National Exposure Research Laboratory, Office of Research and Development, US Environmental Protection Agency, 109 T.W. Alexander Drive, Research Triangle Park, NC 27711, USA.
Talanta. 2018 May 15;182:371-379. doi: 10.1016/j.talanta.2018.01.022. Epub 2018 Jan 11.
High-resolution mass spectrometry (HRMS) data has revolutionized the identification of environmental contaminants through non-targeted analysis (NTA). However, chemical identification remains challenging due to the vast number of unknown molecular features typically observed in environmental samples. Advanced data processing techniques are required to improve chemical identification workflows. The ideal workflow brings together a variety of data and tools to increase the certainty of identification. One such tool is chromatographic retention time (RT) prediction, which can be used to reduce the number of possible suspect chemicals within an observed RT window. This paper compares the relative predictive ability and applicability to NTA workflows of three RT prediction models: (1) a logP (octanol-water partition coefficient)-based model using EPI Suite™ logP predictions; (2) a commercially available ACD/ChromGenius model; and, (3) a newly developed Quantitative Structure Retention Relationship model called OPERA-RT. Models were developed using the same training set of 78 compounds with experimental RT data and evaluated for external predictivity on an identical test set of 19 compounds. Both the ACD/ChromGenius and OPERA-RT models outperformed the EPI Suite™ logP-based RT model (R = 0.81-0.92, 0.86-0.83, 0.66-0.69 for training-test sets, respectively). Further, both OPERA-RT and ACD/ChromGenius predicted 95% of RTs within a ± 15% chromatographic time window of experimental RTs. Based on these results, we simulated an NTA workflow with a ten-fold larger list of candidate structures generated for formulae of the known test set chemicals using the U.S. EPA's CompTox Chemistry Dashboard (https://comptox.epa.gov/dashboard), RTs for all candidates were predicted using both ACD/ChromGenius and OPERA-RT, and RT screening windows were assessed for their ability to filter out unlikely candidate chemicals and enhance potential identification. Compared to ACD/ChromGenius, OPERA-RT screened out a greater percentage of candidate structures within a 3-min RT window (60% vs. 40%) but retained fewer of the known chemicals (42% vs. 83%). By several metrics, the OPERA-RT model, generated as a proof-of-concept using a limited set of open source data, performed as well as the commercial tool ACD/ChromGenius when constrained to the same small training and test sets. As the availability of RT data increases, we expect the OPERA-RT model's predictive ability will increase.
高分辨率质谱(HRMS)数据通过非靶向分析(NTA)彻底改变了环境污染物的识别方式。然而,由于在环境样品中通常会观察到大量未知的分子特征,化学物质的识别仍然具有挑战性。需要先进的数据处理技术来改进化学物质识别工作流程。理想的工作流程整合了各种数据和工具,以提高识别的确定性。其中一种工具是色谱保留时间(RT)预测,它可用于减少在观察到的RT窗口内可能的可疑化学物质数量。本文比较了三种RT预测模型对NTA工作流程的相对预测能力和适用性:(1)基于EPI Suite™logP预测的基于logP(正辛醇-水分配系数)的模型;(2)市售的ACD/ChromGenius模型;以及(3)一种新开发的名为OPERA-RT的定量结构保留关系模型。使用包含78种化合物的相同训练集及其实验RT数据开发模型,并在包含19种化合物的相同测试集上评估其外部预测能力。ACD/ChromGenius模型和OPERA-RT模型均优于基于EPI Suite™logP的RT模型(训练集-测试集的R分别为0.81-0.92、0.86-0.83、0.66-0.69)。此外,OPERA-RT和ACD/ChromGenius均在实验RT的±15%色谱时间窗口内预测了95%的RT。基于这些结果,我们模拟了一个NTA工作流程,使用美国环境保护局的CompTox化学仪表盘(https://comptox.epa.gov/dashboard)为已知测试集化学品的分子式生成了十倍大的候选结构列表,使用ACD/ChromGenius和OPERA-RT预测了所有候选物的RT,并评估了RT筛选窗口过滤掉不太可能的候选化学物质和增强潜在识别的能力。与ACD/ChromGenius相比,OPERA-RT在3分钟的RT窗口内筛选出了更大比例的候选结构(60%对40%),但保留的已知化学物质较少(42%对83%)。通过多项指标衡量,使用有限的一组开源数据作为概念验证生成的OPERA-RT模型,在受限于相同的小训练集和测试集时,其表现与商业工具ACD/ChromGenius相当。随着RT数据可用性的增加,我们预计OPERA-RT模型的预测能力将会提高。