Le Phi, Gong Xingyue, Ung Leah, Yang Hai, Keenan Bridget P, Zhang Li, He Tao
Division of Hematology/Oncology, Department of Medicine, University of California, San Francisco, San Francisco, CA, United States.
Department of Physiological Nursing, School of Nursing, University of California, San Francisco, San Francisco, CA, United States.
Front Syst Biol. 2024;4. doi: 10.3389/fsysb.2024.1355595. Epub 2024 Mar 20.
Exploring features associated with the clinical outcome of interest is a rapidly advancing area of research. However, with contemporary sequencing technologies capable of identifying over thousands of genes per sample, there is a challenge in constructing efficient prediction models that balance accuracy and resource utilization. To address this challenge, researchers have developed feature selection methods to enhance performance, reduce overfitting, and ensure resource efficiency. However, applying feature selection models to survival analysis, particularly in clinical datasets characterized by substantial censoring and limited sample sizes, introduces unique challenges. We propose a robust ensemble feature selection approach integrated with group Lasso to identify compelling features and evaluate its performance in predicting survival outcomes. Our approach consistently outperforms established models across various criteria through extensive simulations, demonstrating low false discovery rates, high sensitivity, and high stability. Furthermore, we applied the approach to a colorectal cancer dataset from The Cancer Genome Atlas, showcasing its effectiveness by generating a composite score based on the selected genes to correctly distinguish different subtypes of the patients. In summary, our proposed approach excels in selecting impactful features from high-dimensional data, yielding better outcomes compared to contemporary state-of-the-art models.
探索与感兴趣的临床结果相关的特征是一个快速发展的研究领域。然而,当代测序技术能够在每个样本中识别数千个基因,在构建平衡准确性和资源利用的高效预测模型方面存在挑战。为应对这一挑战,研究人员开发了特征选择方法以提高性能、减少过拟合并确保资源效率。然而,将特征选择模型应用于生存分析,尤其是在存在大量删失和样本量有限的临床数据集中,会带来独特的挑战。我们提出了一种与组套索集成的稳健集成特征选择方法,以识别有说服力的特征并评估其在预测生存结果方面的性能。通过广泛的模拟,我们的方法在各种标准下始终优于现有模型,显示出低错误发现率、高灵敏度和高稳定性。此外,我们将该方法应用于来自癌症基因组图谱的结直肠癌数据集,通过基于所选基因生成综合评分来正确区分患者的不同亚型,展示了其有效性。总之,我们提出的方法在从高维数据中选择有影响力的特征方面表现出色,与当代最先进的模型相比产生了更好的结果。