Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY, United States.
Front Public Health. 2021 Jun 10;9:653599. doi: 10.3389/fpubh.2021.653599. eCollection 2021.
An untargeted chemical analysis of bio-fluids provides semi-quantitative data for thousands of chemicals for expanding our understanding about relationships among metabolic pathways, diseases, phenotypes and exposures. During the processing of mass spectral and chromatography data, various signal thresholds are used to control the number of peaks in the final data matrix that is used for statistical analyses. However, commonly used stringent thresholds generate constrained data matrices which may under-represent the detected chemical space, leading to missed biological insights in the exposome research. We have re-analyzed a liquid chromatography high resolution mass spectrometry data set for a publicly available epidemiology study ( = 499) of human cord blood samples using the MS-DIAL software with minimally possible thresholds during the data processing steps. Peak list for individual files and the data matrix after alignment and gap-filling steps were summarized for different peak height and detection frequency thresholds. Correlations between birth weight and LC/MS peaks in the newly generated data matrix were computed using the spearman correlation coefficient. MS-DIAL software detected on average 23,156 peaks for individual LC/MS file and 63,393 peaks in the aligned peak table. A combination of peak height and detection frequency thresholds that was used in the original publication at the individual file and the peak alignment levels can reject 90% peaks from the untargeted chemical analysis dataset that was generated by MS-DIAL. Correlation analysis for birth weight data suggested that up to 80% of the significantly associated peaks were rejected by the data processing thresholds that were used in the original publication. The re-analysis with minimum possible thresholds recovered metabolic insights about C19 steroids and hydroxy-acyl-carnitines and their relationships with birth weight. Data processing thresholds for peak height and detection frequencies at individual data file and at the alignment level should be used at minimal possible level or completely avoided for mining untargeted chemical analysis data in the exposome research for discovering new biomarkers and mechanisms.
生物体液的非靶向化学分析为代谢途径、疾病、表型和暴露之间的关系提供了数千种化学物质的半定量数据,从而扩展了我们的认识。在处理质谱和色谱数据时,会使用各种信号阈值来控制最终用于统计分析的数据矩阵中的峰数。然而,常用的严格阈值会生成受限的数据矩阵,从而可能无法充分代表检测到的化学空间,导致在暴露组学研究中错失生物学见解。我们使用 MS-DIAL 软件重新分析了一个公开的人类脐带血样本流行病学研究(n = 499)的液相色谱高分辨率质谱数据集,在数据处理步骤中使用了尽可能最小的阈值。对不同峰高和检测频率阈值下的单个文件的峰列表和对齐和填补步骤后的数据矩阵进行了总结。使用 Spearman 相关系数计算了新生成的数据矩阵中出生体重与 LC/MS 峰之间的相关性。MS-DIAL 软件平均为每个 LC/MS 文件检测到 23156 个峰,在对齐的峰表中检测到 63393 个峰。在原始出版物中在单个文件和峰对齐级别使用的峰高和检测频率阈值的组合可以将 MS-DIAL 生成的非靶向化学分析数据集的 90%的峰拒之门外。出生体重数据的相关分析表明,原始出版物中使用的数据处理阈值拒接了 80%的显著相关峰。使用尽可能最小的阈值进行重新分析恢复了与出生体重有关的 C19 类固醇和羟基酰基辅酶 A 的代谢见解。在暴露组学研究中,用于挖掘非靶向化学分析数据以发现新的生物标志物和机制时,应在尽可能低的水平或完全避免使用单个数据文件和对齐级别上的峰高和检测频率数据处理阈值。