Wang Sehee, Kim So Yeon, Sohn Kyung-Ah
Department of Artificial Intelligence, Ajou University, Suwon 16499, Republic of Korea.
Department of Software and Computer Engineering, Ajou University, Suwon 16499, Republic of Korea.
Bioengineering (Basel). 2023 Jul 10;10(7):824. doi: 10.3390/bioengineering10070824.
Feature selection methods are essential for accurate disease classification and identifying informative biomarkers. While information-theoretic methods have been widely used, they often exhibit limitations such as high computational costs. Our previously proposed method, ClearF, addresses these issues by using reconstruction error from low-dimensional embeddings as a proxy for the entropy term in the mutual information. However, ClearF still has limitations, including a nontransparent bottleneck layer selection process, which can result in unstable feature selection. To address these limitations, we propose ClearF++, which simplifies the bottleneck layer selection and incorporates feature-wise clustering to enhance biomarker detection. We compare its performance with other commonly used methods such as MultiSURF and IFS, as well as ClearF, across multiple benchmark datasets. Our results demonstrate that ClearF++ consistently outperforms these methods in terms of prediction accuracy and stability, even with limited samples. We also observe that employing the Deep Embedded Clustering (DEC) algorithm for feature-wise clustering improves performance, indicating its suitability for handling complex data structures with limited samples. ClearF++ offers an improved biomarker prioritization approach with enhanced prediction performance and faster execution. Its stability and effectiveness with limited samples make it particularly valuable for biomedical data analysis.
特征选择方法对于准确的疾病分类和识别信息性生物标志物至关重要。虽然信息论方法已被广泛使用,但它们常常表现出诸如计算成本高之类的局限性。我们之前提出的ClearF方法,通过使用低维嵌入的重构误差作为互信息中熵项的替代来解决这些问题。然而,ClearF仍然存在局限性,包括瓶颈层选择过程不透明,这可能导致特征选择不稳定。为了解决这些局限性,我们提出了ClearF++,它简化了瓶颈层选择,并纳入了特征级聚类以增强生物标志物检测。我们在多个基准数据集上,将其性能与其他常用方法(如MultiSURF和IFS)以及ClearF进行了比较。我们的结果表明,即使样本有限,ClearF++在预测准确性和稳定性方面始终优于这些方法。我们还观察到,采用深度嵌入聚类(DEC)算法进行特征级聚类可提高性能,这表明它适用于处理样本有限的复杂数据结构。ClearF++提供了一种改进的生物标志物优先级排序方法,具有更高的预测性能和更快的执行速度。它在样本有限时的稳定性和有效性使其在生物医学数据分析中特别有价值。