Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA.
Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA.
Int J Mol Sci. 2022 Sep 6;23(18):10220. doi: 10.3390/ijms231810220.
The use of high-throughput omics technologies is becoming increasingly popular in all facets of biomedical science. The mRNA sequencing (RNA-seq) method reports quantitative measures of more than tens of thousands of biological features. It provides a more comprehensive molecular perspective of studied cancer mechanisms compared to traditional approaches. Graph-based learning models have been proposed to learn important hidden representations from gene expression data and network structure to improve cancer outcome prediction, patient stratification, and cell clustering. However, these graph-based methods cannot rank the importance of the different neighbors for a particular sample in the downstream cancer subtype analyses. In this study, we introduce omicsGAT, a graph attention network (GAT) model to integrate graph-based learning with an attention mechanism for RNA-seq data analysis. The multi-head attention mechanism in omicsGAT can more effectively secure information of a particular sample by assigning different attention coefficients to its neighbors. Comprehensive experiments on The Cancer Genome Atlas (TCGA) breast cancer and bladder cancer bulk RNA-seq data and two single-cell RNA-seq datasets validate that (1) the proposed model can effectively integrate neighborhood information of a sample and learn an embedding vector to improve disease phenotype prediction, cancer patient stratification, and cell clustering of the sample and (2) the attention matrix generated from the multi-head attention coefficients provides more useful information compared to the sample correlation-based adjacency matrix. From the results, we can conclude that some neighbors play a more important role than others in cancer subtype analyses of a particular sample based on the attention coefficient.
高通量组学技术在生物医学科学的各个方面的应用越来越受欢迎。mRNA 测序(RNA-seq)方法报告了超过数万种生物特征的定量测量。与传统方法相比,它为研究的癌症机制提供了更全面的分子视角。已经提出了基于图的学习模型,以从基因表达数据和网络结构中学习重要的隐藏表示,以提高癌症预后预测、患者分层和细胞聚类。然而,这些基于图的方法无法对下游癌症亚型分析中特定样本的不同邻居的重要性进行排名。在这项研究中,我们引入了 omicsGAT,这是一种图注意力网络(GAT)模型,用于将基于图的学习与注意力机制集成到 RNA-seq 数据分析中。omicsGAT 中的多头注意力机制可以通过为其邻居分配不同的注意力系数,更有效地获取特定样本的信息。对 The Cancer Genome Atlas(TCGA)乳腺癌和膀胱癌批量 RNA-seq 数据以及两个单细胞 RNA-seq 数据集的综合实验验证了(1)所提出的模型可以有效地整合样本的邻居信息并学习嵌入向量,以提高疾病表型预测、癌症患者分层和样本的细胞聚类,以及(2)与基于样本相关性的邻接矩阵相比,多头注意力系数生成的注意力矩阵提供了更有用的信息。从结果中可以得出结论,根据注意力系数,一些邻居在特定样本的癌症亚型分析中比其他邻居发挥更重要的作用。