Suppr超能文献

基于特征选择的癌症诊断生物标志物识别框架:以肺腺癌为例。

A feature selection-based framework to identify biomarkers for cancer diagnosis: A focus on lung adenocarcinoma.

机构信息

University of Science and Technology, Zewail City of Science and Technology, Giza, Egypt.

Center of Informatics Science, Nile university, Giza, Egypt.

出版信息

PLoS One. 2022 Sep 6;17(9):e0269126. doi: 10.1371/journal.pone.0269126. eCollection 2022.

Abstract

Lung cancer (LC) represents most of the cancer incidences in the world. There are many types of LC, but Lung Adenocarcinoma (LUAD) is the most common type. Although RNA-seq and microarray data provide a vast amount of gene expression data, most of the genes are insignificant to clinical diagnosis. Feature selection (FS) techniques overcome the high dimensionality and sparsity issues of the large-scale data. We propose a framework that applies an ensemble of feature selection techniques to identify genes highly correlated to LUAD. Utilizing LUAD RNA-seq data from the Cancer Genome Atlas (TCGA), we employed mutual information (MI) and recursive feature elimination (RFE) feature selection techniques along with support vector machine (SVM) classification model. We have also utilized Random Forest (RF) as an embedded FS technique. The results were integrated and candidate biomarker genes across all techniques were identified. The proposed framework has identified 12 potential biomarkers that are highly correlated with different LC types, especially LUAD. A predictive model has been trained utilizing the identified biomarker expression profiling and performance of 97.99% was achieved. In addition, upon performing differential gene expression analysis, we could find that all 12 genes were significantly differentially expressed between normal and LUAD tissues, and strongly correlated with LUAD according to previous reports. We here propose that using multiple feature selection methods effectively reduces the number of identified biomarkers and directly affects their biological relevance.

摘要

肺癌(LC)是世界上大多数癌症发病率的代表。LC 有许多类型,但肺腺癌(LUAD)是最常见的类型。虽然 RNA-seq 和微阵列数据提供了大量的基因表达数据,但大多数基因对临床诊断意义不大。特征选择(FS)技术克服了大规模数据的高维性和稀疏性问题。我们提出了一个框架,该框架应用了特征选择技术的集成来识别与 LUAD 高度相关的基因。我们利用癌症基因组图谱(TCGA)中的 LUAD RNA-seq 数据,采用互信息(MI)和递归特征消除(RFE)特征选择技术以及支持向量机(SVM)分类模型。我们还利用随机森林(RF)作为嵌入式 FS 技术。整合了结果,并确定了所有技术的候选生物标志物基因。所提出的框架已经确定了 12 个与不同 LC 类型,尤其是 LUAD 高度相关的潜在生物标志物。利用鉴定的生物标志物表达谱训练了一个预测模型,实现了 97.99%的性能。此外,在进行差异基因表达分析时,我们可以发现所有 12 个基因在正常组织和 LUAD 组织之间的表达存在显著差异,并且根据以前的报告与 LUAD 强烈相关。我们在此提出,使用多种特征选择方法可以有效地减少鉴定的生物标志物数量,并直接影响其生物学相关性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5994/9447897/11e983ae0814/pone.0269126.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验