Wang Ruo Han, Ng Yen Kaow, Zhang Xianglilan, Wang Jianping, Li Shuai Cheng
Department of Computer Science, City University of Hong Kong Shenzhen Research Institute, Shen Zhen, 518063, China.
Department of Computer Science, City University of Hong Kong, Hong Kong, 999077, China.
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae188.
Genome sequencing technologies reveal a huge amount of genomic sequences. Neural network-based methods can be prime candidates for retrieving insights from these sequences because of their applicability to large and diverse datasets. However, the highly variable lengths of genome sequences severely impair the presentation of sequences as input to the neural network. Genetic variations further complicate tasks that involve sequence comparison or alignment.
Inspired by the theory and applications of "spaced seeds," we propose a graph representation of genome sequences called "gapped pattern graph." These graphs can be transformed through a Graph Convolutional Network to form lower-dimensional embeddings for downstream tasks. On the basis of the gapped pattern graphs, we implemented a neural network model and demonstrated its performance on diverse tasks involving microbe and mammalian genome data. Our method consistently outperformed all the other state-of-the-art methods across various metrics on all tasks, especially for the sequences with limited homology to the training data. In addition, our model was able to identify distinct gapped pattern signatures from the sequences.
The framework is available at https://github.com/deepomicslab/GCNFrame.
基因组测序技术揭示了大量的基因组序列。基于神经网络的方法因其适用于大型多样数据集,可能是从这些序列中获取见解的主要候选方法。然而,基因组序列高度可变的长度严重损害了将序列作为神经网络输入的呈现方式。遗传变异进一步使涉及序列比较或比对的任务变得复杂。
受“间隔种子”理论和应用的启发,我们提出了一种基因组序列的图形表示,称为“间隙模式图”。这些图可以通过图卷积网络进行转换,以形成用于下游任务的低维嵌入。基于间隙模式图,我们实现了一个神经网络模型,并展示了其在涉及微生物和哺乳动物基因组数据的各种任务上的性能。在所有任务的各种指标上,我们的方法始终优于所有其他现有最先进方法,特别是对于与训练数据同源性有限的序列。此外,我们的模型能够从序列中识别出不同的间隙模式特征。