Zhang Zixuan, Xu Xin, Xing Shipei, Shi Changzhi, You Zecang, Deng Xiaojun, Tan Ling, Mo Zhe, Fang Mingliang
Particle Pollution and Prevention (LAP3), Department of Environmental Science and Engineering, Fudan University, Shanghai 200438, China.
Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California 92093, United States.
Anal Chem. 2025 Jan 21;97(2):1170-1179. doi: 10.1021/acs.analchem.4c04249. Epub 2025 Jan 7.
Polycyclic aromatic hydrocarbons (PAHs) are pervasive environmental pollutants with significant health risks due to their carcinogenic, mutagenic, and teratogenic properties. Traditional methods for PAH identification, primarily relying on gas chromatography-mass spectrometry (GC-MS), utilize spectral library searches together with other techniques, such as mass defect analysis. However, these methods are limited by incomplete spectral libraries and a high false positive rate. Here, we present PAH-Finder, a data-driven workflow that integrates machine learning with high-resolution mass spectrometry (HRMS). PAH-Finder introduces a novel approach to evaluate the fragment distribution of PAH backbones in MS spectra by normalizing fragment / values to a 0-100% range relative to the molecular ion peak. Seven machine learning features capture PAH fragmentation characteristics, and a random forest model trained on 98 PAH spectra and 1003 background spectra achieved an F1 score of ∼0.9 in 5-fold cross validation. Additionally, PAH-Finder leverages the presence of doubly charged fragments and molecular formula prediction to enhance the identification accuracy. In a case study, PAH-Finder identified 135 PAHs, including 7 types of previously unreported PAH formulas in particulate matter samples, demonstrating a 246% increase in annotation efficiency compared to the NIST20 library search. It also identified 32 heteroatom-doped PAHs not included in the training data set, showcasing its robustness of generalization. PAH-Finder's high accuracy in detecting a broad spectrum of PAHs facilitates efficient data processing and interpretation for nontargeted analysis, enhancing our understanding of air pollution and public health protection. PAH-Finder is freely available at Github (https://github.com/FangLabNTU/PAH-Finder).
多环芳烃(PAHs)是普遍存在的环境污染物,因其致癌、致突变和致畸特性而具有重大健康风险。传统的PAH识别方法主要依赖气相色谱 - 质谱联用(GC - MS),利用光谱库搜索以及其他技术,如质量亏损分析。然而,这些方法受到光谱库不完整和高误报率的限制。在此,我们展示了PAH - Finder,这是一种将机器学习与高分辨率质谱(HRMS)相结合的数据驱动工作流程。PAH - Finder引入了一种新方法,通过将碎片/值相对于分子离子峰归一化到0 - 100%范围来评估质谱图中PAH主链的碎片分布。七个机器学习特征捕捉PAH的碎片化特征,在98个PAH光谱和1003个背景光谱上训练的随机森林模型在五折交叉验证中实现了约0.9的F1分数。此外,PAH - Finder利用双电荷碎片的存在和分子式预测来提高识别准确性。在一个案例研究中,PAH - Finder识别出135种PAHs,包括颗粒物样品中7种先前未报告的PAH分子式,与NIST20库搜索相比,注释效率提高了246%。它还识别出了训练数据集中未包含的32种杂原子掺杂的PAHs,展示了其泛化的稳健性。PAH - Finder在检测广泛的PAHs方面的高精度有助于非靶向分析的高效数据处理和解释,增强了我们对空气污染和公共卫生保护的理解。PAH - Finder可在Github(https://github.com/FangLabNTU/PAH - Finder)上免费获取。