College of Public Health, Zhengzhou University, Zhengzhou, 450001, China.
BMC Bioinformatics. 2022 May 12;23(1):175. doi: 10.1186/s12859-022-04689-9.
BACKGROUND: Lung cancer is one of the cancers with the highest mortality rate in China. With the rapid development of high-throughput sequencing technology and the research and application of deep learning methods in recent years, deep neural networks based on gene expression have become a hot research direction in lung cancer diagnosis in recent years, which provide an effective way of early diagnosis for lung cancer. Thus, building a deep neural network model is of great significance for the early diagnosis of lung cancer. However, the main challenges in mining gene expression datasets are the curse of dimensionality and imbalanced data. The existing methods proposed by some researchers can't address the problems of high-dimensionality and imbalanced data, because of the overwhelming number of variables measured (genes) versus the small number of samples, which result in poor performance in early diagnosis for lung cancer. METHOD: Given the disadvantages of gene expression data sets with small datasets, high-dimensionality and imbalanced data, this paper proposes a gene selection method based on KL divergence, which selects some genes with higher KL divergence as model features. Then build a deep neural network model using Focal Loss as loss function, at the same time, we use k-fold cross validation method to verify and select the best model, we set the value of k is five in this paper. RESULT: The deep learning model method based on KL divergence gene selection proposed in this paper has an AUC of 0.99 on the validation set. The generalization performance of model is high. CONCLUSION: The deep neural network model based on KL divergence gene selection proposed in this paper is proved to be an accurate and effective method for lung cancer prediction.
背景:肺癌是中国死亡率最高的癌症之一。随着高通量测序技术的快速发展和近年来深度学习方法的研究与应用,基于基因表达的深度神经网络已成为近年来肺癌诊断的一个热门研究方向,为肺癌的早期诊断提供了有效的方法。因此,构建深度神经网络模型对于肺癌的早期诊断具有重要意义。然而,挖掘基因表达数据集的主要挑战是维数灾难和数据不平衡。一些研究人员提出的现有方法不能解决高维数据和不平衡数据的问题,因为所测量的变量(基因)数量与样本数量相比过于庞大,从而导致肺癌早期诊断的性能较差。
方法:鉴于数据集小、高维数据和数据不平衡的缺点,本文提出了一种基于 KL 散度的基因选择方法,该方法选择一些具有较高 KL 散度的基因作为模型特征。然后使用焦点损失作为损失函数构建深度神经网络模型,同时,我们使用 k 折交叉验证方法进行验证和选择最佳模型,在本文中我们设置 k 的值为 5。
结果:本文提出的基于 KL 散度基因选择的深度学习模型方法在验证集上的 AUC 为 0.99。模型的泛化性能较高。
结论:本文提出的基于 KL 散度基因选择的深度神经网络模型被证明是一种准确有效的肺癌预测方法。
BMC Bioinformatics. 2022-5-12
BMC Med Genomics. 2020-12-28
Artif Intell Med. 2019-7-25
Comput Methods Programs Biomed. 2018-10-5
Prog Biophys Mol Biol. 2023-1
Comput Methods Programs Biomed. 2017-9-14
JAMA Netw Open. 2020-6-1
BMC Med Inform Decis Mak. 2025-5-14
J Biol Methods. 2024-8-9
Front Med (Lausanne). 2024-1-30
Bioengineering (Basel). 2023-1-28
Cancers (Basel). 2020-3-5
Transl Lung Cancer Res. 2018-6
IEEE Trans Pattern Anal Mach Intell. 2020-2
Cancer Genomics Proteomics. 2018
Comput Methods Programs Biomed. 2017-9-14
Neural Netw. 2015-1
Cell. 2011-3-4