Université de Strasbourg (Unistra), Institut National de la Santé et de la Recherche Médicale, IRFAC Inserm U1113, 3 Avenue Molière, 67200Strasbourg, France.
Université de Reims Champagne-Ardenne, BioSpecT EA 7506, 51 Rue Cognacq-Jay, 51097Reims, France.
Anal Chem. 2022 Nov 22;94(46):16050-16059. doi: 10.1021/acs.analchem.2c03118. Epub 2022 Nov 8.
Dimensional reduction of highly multidimensional datasets such as those acquired by Fourier transform infrared spectroscopy (FTIR) is a critical step in the data analysis workflow. To achieve this goal, numerous feature selection methods have been developed and applied in a supervised context, i.e., using a priori knowledge about data usually in the form of labels for classification or quantitative values for regression. For this, genetic algorithms have been largely exploited due to their flexibility and global optimization principle. However, few applications in an unsupervised context have been reported in infrared spectroscopy. The aim of this article is to propose a new unsupervised feature selection method based on a genetic algorithm using a validity index computed from KMeans partitions as a fitness function. Evaluated on a simulated dataset and validated and tested on three real-world infrared spectroscopic datasets, our developed algorithm is able to find the spectral descriptors improving clustering accuracy and simplifying the spectral interpretation of results.
高维多维数据集(如傅里叶变换红外光谱(FTIR)获得的数据集)的降维是数据分析工作流程中的关键步骤。为了实现这一目标,已经开发并应用了许多特征选择方法,这些方法在有监督的情况下使用,即使用有关数据的先验知识,通常以分类的标签或回归的定量值的形式。为此,由于其灵活性和全局优化原理,遗传算法被广泛利用。然而,在红外光谱学中,很少有报道在无监督的情况下应用。本文的目的是提出一种新的基于遗传算法的无监督特征选择方法,该方法使用从 KMeans 分区计算的有效性指数作为适应度函数。在模拟数据集上进行评估,并在三个真实的红外光谱数据集上进行验证和测试,我们开发的算法能够找到改善聚类精度和简化结果光谱解释的光谱描述符。