通过综合多组学数据对肺腺癌进行亚分类以改善生存结果。

Subclassification of lung adenocarcinoma through comprehensive multi-omics data to benefit survival outcomes.

作者信息

Wei Jiayi, Wang Xin, Guo Hongping, Zhang Ling, Shi Yao, Wang Xiao

机构信息

Qingdao University, Qingdao, China.

Hubei Normal University, China.

出版信息

Comput Biol Chem. 2024 Oct;112:108150. doi: 10.1016/j.compbiolchem.2024.108150. Epub 2024 Jul 14.

DOI:10.1016/j.compbiolchem.2024.108150

PMID:39018587

Abstract

OBJECTIVES

Lung adenocarcinoma (LUAD) is the most common subtype of non-small cell lung cancer. Understanding the molecular mechanisms underlying tumor progression is of great clinical significance. This study aims to identify novel molecular markers associated with LUAD subtypes, with the goal of improving the precision of LUAD subtype classification. Additionally, optimization efforts are directed towards enhancing insights from the perspective of patient survival analysis.

MATERIALS AND METHODS

We propose an innovative feature-selection approach that focuses on LUAD classification, which is comprehensive and robust. The proposed method integrates multi-omics data from The Cancer Genome Atlas (TCGA) and leverages a synergistic combination of max-relevance and min-redundancy, least absolute shrinkage and selection operator, and Boruta algorithms. These selected features were deployed in six machine-learning classifiers: logistic regression, random forest, support vector machine, naive Bayes, k-Nearest Neighbor, and XGBoost.

RESULTS

The proposed approach achieved an area under the receiver operating characteristic curve (AUC) of 0.9958 for LR. Notably, the accuracy and AUC of a composite model incorporating copy number, methylation, as well as RNA- sequencing data for expression of exons, genes, and miRNA mature strands surpassed the accuracy and AUC metrics of models with single-omics data or other multi-omics combinations. Survival analyses, revealed the SVM classifier to elicit optimal classification, outperforming that achieved by TCGA. To enhance model interpretability, SHapley Additive exPlanations (SHAP) values were utilized to elucidate the impact of each feature on the predictions. Gene Ontology (GO) enrichment analysis identified significant biological processes, molecular functions, and cellular components associated with LUAD subtypes.

CONCLUSION

In summary, our feature selection process, based on TCGA multi-omics data and combined with multiple machine learning classifiers, proficiently identifies molecular subtypes of lung adenocarcinoma and their corresponding significant genes. Our method could enhance the early detection and diagnosis of LUAD, expedite the development of targeted therapies and, ultimately, lengthen patient survival.

摘要

目的

肺腺癌（LUAD）是非小细胞肺癌最常见的亚型。了解肿瘤进展的分子机制具有重要的临床意义。本研究旨在识别与LUAD亚型相关的新型分子标志物，以提高LUAD亚型分类的准确性。此外，还致力于从患者生存分析的角度优化并增强相关见解。

材料与方法

我们提出了一种创新的特征选择方法，专注于全面且稳健的LUAD分类。该方法整合了来自癌症基因组图谱（TCGA）的多组学数据，并利用最大相关性和最小冗余性、最小绝对收缩和选择算子以及Boruta算法的协同组合。这些选定的特征被应用于六个机器学习分类器：逻辑回归、随机森林、支持向量机、朴素贝叶斯、k近邻和XGBoost。

结果

所提出的方法在逻辑回归中实现了受试者工作特征曲线（AUC）下面积为0.9958。值得注意的是，结合拷贝数、甲基化以及外显子、基因和miRNA成熟链表达的RNA测序数据的复合模型的准确性和AUC超过了单组学数据模型或其他多组学组合模型的准确性和AUC指标。生存分析表明，支持向量机分类器能实现最佳分类，优于TCGA所达到的分类效果。为了增强模型的可解释性，利用SHapley加法解释（SHAP）值来阐明每个特征对预测的影响。基因本体（GO）富集分析确定了与LUAD亚型相关的重要生物学过程、分子功能和细胞成分。