Zhang Qiaosheng, Wei Yalong, Hou Jie, Li Hongpeng, Zhong Zhaoman
School of Computer Engineering, Jiangsu Ocean University, Lianyungang, 222005, China.
Public Teaching and Research Department, Huzhou College, Huzhou, 313000, China.
BMC Bioinformatics. 2024 Dec 27;25(1):392. doi: 10.1186/s12859-024-06013-z.
Cancer classification has consistently been a challenging problem, with the main difficulties being high-dimensional data and the collection of patient samples. Concretely, obtaining patient samples is a costly and resource-intensive process, and imbalances often exist between samples. Moreover, expression data is characterized by high dimensionality, small samples and high noise, which could easily lead to struggles such as dimensionality catastrophe and overfitting. Thus, we incorporate prior knowledge from the pathway and combine AutoEncoder and Generative Adversarial Network (GAN) to solve these difficulties.
In this study, we propose an effective and efficient deep learning method, named AEGAN, which combines the capabilities of AutoEncoder and GAN to generate synthetic samples of the minority class in imbalanced gene expression data. The proposed data balancing technique has been demonstrated to be useful for cancer classification and improving the performance of classifier models. Additionally, we integrate prior knowledge from the pathway and employ the pathifier algorithm to calculate pathway scores for each sample. This data augmentation approach, referred to as AEGAN-Pathifier, not only preserves the biological functionality of the data but also possesses dimensional reduction capabilities. Through validation with various classifiers, the experimental results show an improvement in classifier performance.
AEGAN-Pathifier shows improved performance on the imbalanced datasets GSE25066, GSE20194, BRCA and Liver24. Results from various classifiers indicate that AEGAN-Pathifier has good generalization capability.
癌症分类一直是一个具有挑战性的问题,主要困难在于高维数据和患者样本的收集。具体而言,获取患者样本是一个成本高昂且资源密集的过程,样本之间往往存在不平衡。此外,表达数据具有高维、小样本和高噪声的特点,这很容易导致诸如维数灾难和过拟合等问题。因此,我们整合了来自通路的先验知识,并结合自动编码器和生成对抗网络(GAN)来解决这些困难。
在本研究中,我们提出了一种有效且高效的深度学习方法,名为AEGAN,它结合了自动编码器和GAN的能力,以在不平衡基因表达数据中生成少数类别的合成样本。所提出的数据平衡技术已被证明对癌症分类和提高分类器模型的性能有用。此外,我们整合了来自通路的先验知识,并采用路径分类算法为每个样本计算通路得分。这种数据增强方法,称为AEGAN-路径分类器,不仅保留了数据的生物学功能,还具有降维能力。通过使用各种分类器进行验证,实验结果表明分类器性能有所提高。
AEGAN-路径分类器在不平衡数据集GSE25066、GSE20194、BRCA和Liver24上表现出改进的性能。各种分类器的结果表明AEGAN-路径分类器具有良好的泛化能力。