Suppr超能文献

基于图块的方差排序增强全面二维气相色谱飞行时间质谱数据的偏最小二乘建模。

Enhancing partial least squares modeling of comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry data by tile-based variance ranking.

机构信息

Department of Chemistry, University of Washington, Box 351700, Seattle, WA, 98195, USA.

Department of Chemistry, University of Washington, Box 351700, Seattle, WA, 98195, USA.

出版信息

J Chromatogr A. 2023 Apr 12;1694:463920. doi: 10.1016/j.chroma.2023.463920. Epub 2023 Mar 11.

Abstract

Chemometric methods like partial least squares (PLS) regression are valuable for correlating sample-based differences hidden in comprehensive two-dimensional gas chromatography (GC × GC) data to independently measured physicochemical properties. Herein, this work establishes the first implementation of tile-based variance ranking as a selective data reduction methodology to improve PLS modeling performance of 58 diverse aerospace fuels. Tile-based variance ranking discovered a total of 521 analytes with a square of the relative standard deviation (RSD) in signal between 0.07 to 22.84. The goodness-of-fit for the models were determined by their normalized root-mean-square error of cross-validation (NRMSECV) and normalized root-mean-square error of prediction (NRMSEP). PLS models developed for viscosity, hydrogen content, and heat of combustion using all 521 features discovered by tile-based variance ranking had a respective NRMSECV (NRMSEP) equal to 10.5 % (10.2 %), 8.3 % (7.6 %), and 13.1 % (13.5 %). In contrast, use of a single-grid binning scheme, a common data reduction strategy for PLS analysis, resulted in less accurate models for viscosity (NRMSECV = 14.2 %; NRMSEP = 14.3 %), hydrogen content (NRMSECV = 12.1 %; NRMSEP = 11.0 %), and heat of combustion (NRMSECV = 14.4 %; NRMSEP = 13.6 %). Further, the features discovered by tile-based variance ranking can be optimized for each PLS model with RReliefF analysis, a machine learning algorithm. RReliefF feature optimization selected 48, 125, and 172 analytes out of the original 521 discovered by tile-based variance ranking to model viscosity, hydrogen content, and heat of combustion, respectively. The RReliefF optimized features developed highly accurate property-composition models for viscosity (NRMSECV = 7.9 %; NRMSEP = 5.8 %), hydrogen content (NRMSECV = 7.0 %; NRMSEP = 4.9 %), heat of combustion (NRMSECV = 7.9 %; NRMSEP = 8.4 %). This work also demonstrates that processing the chromatograms with a tile-based approach allows the analyst to directly identify the analytes of importance in a PLS model. Coupling tile-based feature selection with PLS analysis allows for deeper understanding in any property-composition study.

摘要

化学计量学方法,如偏最小二乘法(PLS)回归,对于将隐藏在全二维气相色谱(GC×GC)数据中的基于样本的差异与独立测量的物理化学性质相关联非常有用。在此,本工作首次建立了基于平铺的方差排序作为一种选择性数据减少方法,以提高 58 种不同航空航天燃料的 PLS 建模性能。基于平铺的方差排序共发现了 521 种分析物,其信号的相对标准偏差(RSD)平方在 0.07 到 22.84 之间。通过其归一化交叉验证均方根误差(NRMSECV)和归一化预测均方根误差(NRMSEP)来确定模型的拟合优度。使用基于平铺的方差排序发现的总共 521 个特征开发的用于粘度、氢含量和燃烧热的 PLS 模型的 NRMSECV(NRMSEP)分别等于 10.5%(10.2%)、8.3%(7.6%)和 13.1%(13.5%)。相比之下,使用单网格分箱方案(PLS 分析的常用数据减少策略)会导致粘度模型的精度较低(NRMSECV=14.2%;NRMSEP=14.3%)、氢含量模型(NRMSECV=12.1%;NRMSEP=11.0%)和燃烧热模型(NRMSECV=14.4%;NRMSEP=13.6%)。此外,可以使用机器学习算法 RReliefF 分析对基于平铺的方差排序发现的特征进行优化,以用于每个 PLS 模型。RReliefF 特征优化从基于平铺的方差排序发现的 521 个原始特征中分别选择了 48、125 和 172 个分析物来建模粘度、氢含量和燃烧热。经过 RReliefF 优化的特征为粘度(NRMSECV=7.9%;NRMSEP=5.8%)、氢含量(NRMSECV=7.0%;NRMSEP=4.9%)和燃烧热(NRMSECV=7.9%;NRMSEP=8.4%)建立了高度准确的性质-组成模型。本工作还表明,使用基于平铺的方法处理色谱图可使分析人员能够直接识别 PLS 模型中的重要分析物。将基于平铺的特征选择与 PLS 分析相结合,可以更深入地了解任何性质-组成研究。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验