Pasterski Michael J, Lorenz Matthias, Ievlev Anton V, Wickramasinghe Raveendra C, Hanley Luke, Kenig Fabien
Department of Earth and Environmental Sciences, University of Illinois Chicago, Chicago, Illinois 60607, United States.
Center for Nanophase Materials Sciences, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37830, United States.
J Am Soc Mass Spectrom. 2025 Jan 1;36(1):58-71. doi: 10.1021/jasms.4c00300. Epub 2024 Dec 19.
The spatial distribution of organics in geological samples can be used to determine when and how these organics were incorporated into the host rock. Mass spectrometry (MS) imaging can rapidly collect a large amount of data, but ions produced are mixed without discrimination, resulting in complex mass spectra that can be difficult to interpret. Here, we apply unsupervised and supervised machine learning (ML) to help interpret spectra from time-of-flight-secondary ion mass spectrometry (ToF-SIMS) of an organic-carbon-rich mudstone of the Middle Jurassic of England (UK). It was previously shown that the presence of sterane molecular biomarkers in this sample can be detected via ToF-SIMS (Pasterski, M. J. et al., 2023, 23, 936). We use unsupervised ML on scanning electron microscopy-electron dispersive spectroscopy (SEM-EDS) measurements to define compositional categories based on differences in elemental abundances. We then test the ability of four ML algorithms─k-nearest neighbors (KNN), recursive partitioning and regressive trees (RPART), eXtreme gradient boost (XGBoost), and random forest (RF)─to classify the ToF-SIM spectra using (1) the categories assigned via SEM-EDS, (2) organic and inorganic labels assigned via SEM-EDS, and (3) the presence or absence of detectable steranes in ToF-SIMS spectra. In terms of predictive accuracy and balanced accuracy, KNN was the best performing model and RPART the worst. The feature importance, or the specific features of the ToF-SIM spectra used by the models to make classifications, cannot be determined for KNN, preventing posthoc model interpretation. Nevertheless, the feature importance extracted from the other models was useful for interpreting spectra. We determined that some of the organic ions used to classify biomarker containing spectra may be fragment ions derived from kerogen which is abundant in this mudstone sample.
地质样品中有机物的空间分布可用于确定这些有机物何时以及如何被纳入母岩。质谱成像(MS)可以快速收集大量数据,但产生的离子是混合的,没有区分,导致质谱复杂,难以解释。在这里,我们应用无监督和有监督的机器学习(ML)来帮助解释来自英国英格兰中侏罗世富有机碳泥岩的飞行时间二次离子质谱(ToF-SIMS)的光谱。此前研究表明,通过ToF-SIMS可以检测到该样品中甾烷分子生物标志物的存在(帕斯特斯基,M. J. 等人,2023年,23卷,936页)。我们对扫描电子显微镜-电子色散光谱(SEM-EDS)测量数据使用无监督机器学习,根据元素丰度差异定义成分类别。然后,我们测试了四种机器学习算法——k近邻(KNN)、递归划分和回归树(RPART)、极端梯度提升(XGBoost)和随机森林(RF)——使用(1)通过SEM-EDS分配的类别、(2)通过SEM-EDS分配的有机和无机标签以及(3)ToF-SIMS光谱中可检测甾烷的存在与否对ToF-SIM光谱进行分类的能力。在预测准确性和平衡准确性方面,KNN是表现最佳的模型,而RPART是最差的。对于KNN,无法确定模型用于进行分类的ToF-SIM光谱的特征重要性,这妨碍了事后模型解释。然而,从其他模型中提取的特征重要性对于解释光谱很有用。我们确定,一些用于对含有生物标志物的光谱进行分类的有机离子可能是来自该泥岩样品中丰富的干酪根的碎片离子。