Department of Orthopaedics and Traumatology, The Chinese University of Hong Kong, Hong Kong, People's Republic of China.
Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Guangdong, People's Republic of China.
BMC Bioinformatics. 2024 Aug 8;25(1):260. doi: 10.1186/s12859-024-05870-y.
Quantitative measurement of RNA expression levels through RNA-Seq is an ideal replacement for conventional cancer diagnosis via microscope examination. Currently, cancer-related RNA-Seq studies focus on two aspects: classifying the status and tissue of origin of a sample and discovering marker genes. Existing studies typically identify marker genes by statistically comparing healthy and cancer samples. However, this approach overlooks marker genes with low expression level differences and may be influenced by experimental results. This paper introduces "GENESO," a novel framework for pan-cancer classification and marker gene discovery using the occlusion method in conjunction with deep learning. we first trained a baseline deep LSTM neural network capable of distinguishing the origins and statuses of samples utilizing RNA-Seq data. Then, we propose a novel marker gene discovery method called "Symmetrical Occlusion (SO)". It collaborates with the baseline LSTM network, mimicking the "gain of function" and "loss of function" of genes to evaluate their importance in pan-cancer classification quantitatively. By identifying the genes of utmost importance, we then isolate them to train new neural networks, resulting in higher-performance LSTM models that utilize only a reduced set of highly relevant genes. The baseline neural network achieves an impressive validation accuracy of 96.59% in pan-cancer classification. With the help of SO, the accuracy of the second network reaches 98.30%, while using 67% fewer genes. Notably, our method excels in identifying marker genes that are not differentially expressed. Moreover, we assessed the feasibility of our method using single-cell RNA-Seq data, employing known marker genes as a validation test.
通过 RNA-Seq 对 RNA 表达水平进行定量测量是替代传统显微镜检查进行癌症诊断的理想方法。目前,与 RNA-Seq 相关的癌症研究主要集中在两个方面:对样本的状态和组织来源进行分类,以及发现标记基因。现有研究通常通过统计比较健康和癌症样本来识别标记基因。然而,这种方法忽略了表达水平差异较小的标记基因,并且可能会受到实验结果的影响。本文提出了一种新的框架“GENESO”,该框架使用掩蔽方法与深度学习结合,用于泛癌分类和标记基因发现。我们首先训练了一个基线深度 LSTM 神经网络,该网络能够使用 RNA-Seq 数据区分样本的起源和状态。然后,我们提出了一种新的标记基因发现方法,称为“对称掩蔽(SO)”。它与基线 LSTM 网络协作,模拟基因的“功能获得”和“功能丧失”,以定量评估它们在泛癌分类中的重要性。通过识别最重要的基因,我们将其分离出来以训练新的神经网络,从而产生利用仅少量高度相关基因的高性能 LSTM 模型。基线神经网络在泛癌分类中的验证准确率达到了令人印象深刻的 96.59%。在 SO 的帮助下,第二个网络的准确率达到 98.30%,同时使用的基因数量减少了 67%。值得注意的是,我们的方法在识别非差异表达的标记基因方面表现出色。此外,我们还使用已知标记基因作为验证测试,使用单细胞 RNA-Seq 数据评估了我们方法的可行性。