Shin Hyunseok, Oh Sejong
Department of Computer Science, Dankook University, Youngin, Gyeonggi, South Korea.
Department of Software Science, Dankook University, Youngin, Gyeonggi, South Korea.
BMC Bioinformatics. 2024 Dec 26;25(1):390. doi: 10.1186/s12859-024-06017-9.
High-dimensional datasets with low sample sizes (HDLSS) are pivotal in the fields of biology and bioinformatics. One of core objective of HDLSS is to select most informative features and discarding redundant or irrelevant features. This is particularly crucial in bioinformatics, where accurate feature (gene) selection can lead to breakthroughs in drug development and provide insights into disease diagnostics. Despite its importance, identifying optimal features is still a significant challenge in HDLSS.
To address this challenge, we propose an effective feature selection method that combines gradual permutation filtering with a heuristic tribrid search strategy, specifically tailored for HDLSS contexts. The proposed method considers inter-feature interactions and leverages feature rankings during the search process. In addition, a new performance metric for the HDLSS that evaluates both the number and quality of selected features is suggested. Through the comparison of the benchmark dataset with existing methods, the proposed method reduced the average number of selected features from 37.8 to 5.5 and improved the performance of the prediction model, based on the selected features, from 0.855 to 0.927.
The proposed method effectively selects a small number of important features and achieves high prediction performance.
低样本量的高维数据集(HDLSS)在生物学和生物信息学领域至关重要。HDLSS的核心目标之一是选择最具信息性的特征,并丢弃冗余或不相关的特征。这在生物信息学中尤为关键,因为准确的特征(基因)选择可在药物开发中带来突破,并为疾病诊断提供见解。尽管其很重要,但在HDLSS中识别最佳特征仍然是一项重大挑战。
为应对这一挑战,我们提出了一种有效的特征选择方法,该方法将逐步排列过滤与启发式三杂交搜索策略相结合,特别针对HDLSS环境量身定制。所提出的方法在搜索过程中考虑特征间的相互作用并利用特征排名。此外,还提出了一种用于HDLSS的新性能指标,该指标可评估所选特征的数量和质量。通过将基准数据集与现有方法进行比较,所提出的方法将所选特征的平均数量从37.8减少到5.5,并将基于所选特征的预测模型的性能从0.855提高到0.927。
所提出的方法有效地选择了少量重要特征并实现了高预测性能。