Wu Xinchao, Wang Jieqiong, Wan Shibiao
Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE.
Department of Neurological Sciences, University of Nebraska Medical Center, Omaha, NE.
bioRxiv. 2025 May 7:2025.05.01.651699. doi: 10.1101/2025.05.01.651699.
Lung cancer is the leading cause of cancer death, and non-small cell lung cancer (NSCLC) comprises the largest subtype with most cases. Lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) are two NSCLC subtypes that pose challenges for accurate diagnosis using conventional methods. Existing methods are histological examination and imaging which lacks definitive histologic features and requires intense time.
To address these concerns, we propose RPSLearner, which combines Random Projection (RP) for dimensionality reduction and stacking ensemble learning to accurately predict lung cancer subtypes. Specifically, multiple independent RP matrices were first generated to project the high-dimensional RNA-seq data into lower-dimensional space, whose features were subsequently concatenated. After that, we fed the fused features into a stack of diverse base classifiers and integrated the predictions from base models via a deep linear layer network.
Benchmarking tests on 1,333 NSCLC patients demonstrated that RPSLearner outperformed state-of-the-art approaches for lung cancer subtype classification. Specifically, RPSLearner efficiently preserved sample-to-sample distances even after significant dimension reduction, and the meta-model in RPSLearner yielded consistently higher accuracy, F1 and AUC scores than individual base models and state-of-the-art approaches for lung cancer subtyping. Besides, the feature fusion method applied in RPSLearner shown better performance than conventional scores ensemble methods.
We developed a novel stacking learning method called RPSLearner which combines RP and stacking learning, enabling efficient and accurate identification of NSCLC subtypes. RPSLearner is a promising lung cancer subtyping model for downstream lung cancer clinical diagnosis and personalized treatment, and the framework holds the potentiality to be extended to subtyping of other types of cancer.
肺癌是癌症死亡的主要原因,非小细胞肺癌(NSCLC)是最大的亚型,病例最多。肺腺癌(LUAD)和肺鳞状细胞癌(LUSC)是两种NSCLC亚型,使用传统方法进行准确诊断具有挑战性。现有方法是组织学检查和成像,缺乏明确的组织学特征且需要大量时间。
为了解决这些问题,我们提出了RPSLearner,它结合了用于降维的随机投影(RP)和堆叠集成学习来准确预测肺癌亚型。具体来说,首先生成多个独立的RP矩阵,将高维RNA测序数据投影到低维空间,随后将其特征连接起来。之后,我们将融合后的特征输入到一堆不同的基分类器中,并通过深度线性层网络整合基模型的预测结果。
对1333例NSCLC患者进行的基准测试表明,RPSLearner在肺癌亚型分类方面优于现有方法。具体而言,即使在显著降维后,RPSLearner仍能有效地保留样本间距离,并且RPSLearner中的元模型在肺癌亚型分类方面始终比单个基模型和现有方法产生更高的准确率、F1值和AUC分数。此外,RPSLearner中应用的特征融合方法比传统的分数集成方法表现更好。
我们开发了一种名为RPSLearner的新型堆叠学习方法,它结合了RP和堆叠学习,能够高效准确地识别NSCLC亚型。RPSLearner是一种有前途的肺癌亚型分类模型,可用于下游肺癌临床诊断和个性化治疗,并且该框架具有扩展到其他类型癌症亚型分类的潜力。