TabDEG：基于特征提取和深度学习框架的 RNA-seq 数据差异表达基因分类。

TabDEG: Classifying differentially expressed genes from RNA-seq data based on feature extraction and deep learning framework.

机构信息

School of Mathematics and Statistics, Guangdong University of Technology, Guangzhou, Guangdong, China.

出版信息

PLoS One. 2024 Jul 22;19(7):e0305857. doi: 10.1371/journal.pone.0305857. eCollection 2024.

DOI:10.1371/journal.pone.0305857

PMID:39037985

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11262683/

Abstract

Traditional differential expression genes (DEGs) identification models have limitations in small sample size datasets because they require meeting distribution assumptions, otherwise resulting high false positive/negative rates due to sample variation. In contrast, tabular data model based on deep learning (DL) frameworks do not need to consider the data distribution types and sample variation. However, applying DL to RNA-Seq data is still a challenge due to the lack of proper labeling and the small sample size compared to the number of genes. Data augmentation (DA) extracts data features using different methods and procedures, which can significantly increase complementary pseudo-values from limited data without significant additional cost. Based on this, we combine DA and DL framework-based tabular data model, propose a model TabDEG, to predict DEGs and their up-regulation/down-regulation directions from gene expression data obtained from the Cancer Genome Atlas database. Compared to five counterpart methods, TabDEG has high sensitivity and low misclassification rates. Experiment shows that TabDEG is robust and effective in enhancing data features to facilitate classification of high-dimensional small sample size datasets and validates that TabDEG-predicted DEGs are mapped to important gene ontology terms and pathways associated with cancer.

摘要

传统的差异表达基因（DEGs）鉴定模型在小样本数据集方面存在局限性，因为它们需要满足分布假设，否则由于样本变化会导致高假阳性/阴性率。相比之下，基于深度学习（DL）框架的表格数据模型不需要考虑数据分布类型和样本变化。然而，由于缺乏适当的标记和与基因数量相比样本量较小，将 DL 应用于 RNA-Seq 数据仍然是一个挑战。数据增强（DA）使用不同的方法和程序提取数据特征，这可以在不显著增加额外成本的情况下，从有限的数据中显著增加互补的伪值。基于此，我们结合了 DA 和基于 DL 框架的表格数据模型，提出了一种模型 TabDEG，用于从癌症基因组图谱数据库中获得的基因表达数据中预测 DEGs 及其上调/下调方向。与五个对照方法相比，TabDEG 具有较高的灵敏度和较低的错误分类率。实验表明，TabDEG 增强数据特征的能力稳健且有效，有助于对高维小样本数据集进行分类，并验证了 TabDEG 预测的 DEGs 映射到与癌症相关的重要基因本体术语和途径。