IEEE J Biomed Health Inform. 2021 Aug;25(8):3219-3229. doi: 10.1109/JBHI.2021.3052008. Epub 2021 Aug 5.
The curse of dimensionality, which is caused by high-dimensionality and low-sample-size, is a major challenge in gene expression data analysis. However, the real situation is even worse: labelling data is laborious and time-consuming, so only a small part of the limited samples will be labelled. Having such few labelled samples further increases the difficulty of training deep learning models. Interpretability is an important requirement in biomedicine. Many existing deep learning methods are trying to provide interpretability, but rarely apply to gene expression data. Recent semi-supervised graph convolution network methods try to address these problems by smoothing the label information over a graph. However, to the best of our knowledge, these methods only utilize graphs in either the feature space or sample space, which restrict their performance. We propose a transductive semi-supervised representation learning method called a hierarchical graph convolution network (HiGCN) to aggregate the information of gene expression data in both feature and sample spaces. HiGCN first utilizes external knowledge to construct a feature graph and a similarity kernel to construct a sample graph. Then, two spatial-based GCNs are used to aggregate information on these graphs. To validate the model's performance, synthetic and real datasets are provided to lend empirical support. Compared with two recent models and three traditional models, HiGCN learns better representations of gene expression data, and these representations improve the performance of downstream tasks, especially when the model is trained on a few labelled samples. Important features can be extracted from our model to provide reliable interpretability.
高维低样本量导致的维度灾难是基因表达数据分析中的一个主要挑战。然而,实际情况甚至更糟:标记数据既费力又耗时,因此只有一小部分有限的样本会被标记。如此少的标记样本进一步增加了训练深度学习模型的难度。可解释性是生物医学的一个重要要求。许多现有的深度学习方法都在努力提供可解释性,但很少应用于基因表达数据。最近的半监督图卷积网络方法试图通过在图上平滑标签信息来解决这些问题。然而,据我们所知,这些方法仅在特征空间或样本空间中使用图,这限制了它们的性能。我们提出了一种称为层次图卷积网络(HiGCN)的转导半监督表示学习方法,用于聚合基因表达数据在特征和样本空间中的信息。HiGCN 首先利用外部知识构建特征图和相似性核来构建样本图。然后,使用两个基于空间的 GCN 在这些图上聚合信息。为了验证模型的性能,提供了合成和真实数据集以提供经验支持。与两个最近的模型和三个传统模型相比,HiGCN 学习了更好的基因表达数据表示,并且这些表示提高了下游任务的性能,特别是在模型仅用少量标记样本进行训练时。可以从我们的模型中提取重要特征,以提供可靠的可解释性。