Sun Rongjie, Gardner Wil, Winkler David A, Muir Benjamin W, Pigram Paul J
Centre for Materials and Surface Science and Department of Mathematical and Physical Sciences, La Trobe University, Bundoora, Victoria 3086, Australia.
La Trobe Institute for Molecular Science, La Trobe University, Bundoora, Victoria 3086, Australia.
Anal Chem. 2024 May 14;96(19):7594-7601. doi: 10.1021/acs.analchem.4c00456. Epub 2024 Apr 30.
Multivariate statistical tools and machine learning (ML) techniques can deconvolute hyperspectral data and control the disparity between the number of samples and features in materials science. Nevertheless, the importance of generating sufficient high-quality sample replicates in training data cannot be overlooked, as it fundamentally affects the performance of ML models. Here, we present a quantitative analysis of time-of-flight secondary ion mass spectrometry (ToF-SIMS) spectra of a simple microarray system of two food dyes using partial least-squares (PLS, linear) and random forest (RF, nonlinear) algorithms. This microarray was generated by a high-throughput sample preparation and analysis workflow for fast and efficient acquisition of quality and reproducible spectra via ToF-SIMS. We drew insights from the bias-variance trade-off, investigated the performances of PLS and RF regression models as a function of training data size, and inferred the amount of data needed to construct accurate and reliable regression models. In addition, we found that the spectral concatenation of positive and negative ToF-SIMS spectra improved the model performances. This study provides an empirical basis for future design of high-throughput microarrays and multicomponent systems, for the purpose of analysis with ToF-SIMS and ML.
多元统计工具和机器学习(ML)技术可以对高光谱数据进行去卷积,并控制材料科学中样本数量和特征数量之间的差异。然而,在训练数据中生成足够数量的高质量样本复制品的重要性不可忽视,因为这从根本上影响ML模型的性能。在这里,我们使用偏最小二乘法(PLS,线性)和随机森林(RF,非线性)算法,对两种食用色素的简单微阵列系统的飞行时间二次离子质谱(ToF-SIMS)光谱进行了定量分析。该微阵列是通过高通量样品制备和分析工作流程生成的,以便通过ToF-SIMS快速高效地获取高质量且可重复的光谱。我们从偏差-方差权衡中获得见解,研究了PLS和RF回归模型的性能与训练数据大小的函数关系,并推断出构建准确可靠回归模型所需的数据量。此外,我们发现正、负ToF-SIMS光谱的光谱拼接提高了模型性能。本研究为未来高通量微阵列和多组分系统的设计提供了经验基础,以便用于ToF-SIMS和ML分析。