Le Katelyn, Radović Jagoš R, MacCallum Justin L, Larter Stephen R, Van Humbeck Jeffrey F
Department of Chemistry, University of Calgary, Calgary, Alberta T2N 1N4, Canada.
Center for Petroleum Geochemistry (UH-CPG), Department of Earth and Atmospheric Sciences, University of Houston, Houston, Texas 77204-5007, United States.
J Am Chem Soc. 2024 Aug 14;146(32):22563-22569. doi: 10.1021/jacs.4c06595. Epub 2024 Jul 31.
The ability to quantify individual components of complex mixtures is a challenge found throughout the life and physical sciences. An improved capacity to generate large data sets along with the uptake of machine-learning (ML)-based analysis tools has allowed for various "omics" disciplines to realize exceptional advances. Other areas of chemistry that deal with complex mixtures often do not leverage these advances. Environmental samples, for example, can be more difficult to access, and the resulting small data sets are less appropriate for unconstrained ML approaches. Herein, we present an approach to address this latter issue. Using a very small environmental data set─35 high-resolution mass spectra gathered from various solvent extractions of Canadian petroleum fractions─we show that the application of specific domain knowledge can lead to ML models with notable performance.
量化复杂混合物中各个成分的能力是生命科学和物理科学中普遍存在的一项挑战。随着生成大型数据集能力的提高以及基于机器学习(ML)的分析工具的采用,各种“组学”学科取得了显著进展。处理复杂混合物的化学其他领域通常并未利用这些进展。例如,环境样品可能更难获取,而且由此产生的小数据集不太适合无约束的ML方法。在此,我们提出一种方法来解决后一个问题。我们使用一个非常小的环境数据集——从加拿大石油馏分的各种溶剂萃取中收集的35个高分辨率质谱——表明应用特定领域知识可以产生具有显著性能的ML模型。