MMGAT：一种用于 ATAC-seq 基序发现的图注意力网络框架。

MMGAT: a graph attention network framework for ATAC-seq motifs finding.

机构信息

Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China.

School of Artificial Intelligence, Jilin University, Changchun, 130012, China.

出版信息

BMC Bioinformatics. 2024 Apr 20;25(1):158. doi: 10.1186/s12859-024-05774-x.

DOI:10.1186/s12859-024-05774-x

PMID:38643066

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11031952/

Abstract

BACKGROUND

Motif finding in Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) data is essential to reveal the intricacies of transcription factor binding sites (TFBSs) and their pivotal roles in gene regulation. Deep learning technologies including convolutional neural networks (CNNs) and graph neural networks (GNNs), have achieved success in finding ATAC-seq motifs. However, CNN-based methods are limited by the fixed width of the convolutional kernel, which makes it difficult to find multiple transcription factor binding sites with different lengths. GNN-based methods has the limitation of using the edge weight information directly, makes it difficult to aggregate the neighboring nodes' information more efficiently when representing node embedding.

RESULTS

To address this challenge, we developed a novel graph attention network framework named MMGAT, which employs an attention mechanism to adjust the attention coefficients among different nodes. And then MMGAT finds multiple ATAC-seq motifs based on the attention coefficients of sequence nodes and k-mer nodes as well as the coexisting probability of k-mers. Our approach achieved better performance on the human ATAC-seq datasets compared to existing tools, as evidenced the highest scores on the precision, recall, F1_score, ACC, AUC, and PRC metrics, as well as finding 389 higher quality motifs. To validate the performance of MMGAT in predicting TFBSs and finding motifs on more datasets, we enlarged the number of the human ATAC-seq datasets to 180 and newly integrated 80 mouse ATAC-seq datasets for multi-species experimental validation. Specifically on the mouse ATAC-seq dataset, MMGAT also achieved the highest scores on six metrics and found 356 higher-quality motifs. To facilitate researchers in utilizing MMGAT, we have also developed a user-friendly web server named MMGAT-S that hosts the MMGAT method and ATAC-seq motif finding results.

CONCLUSIONS

The advanced methodology MMGAT provides a robust tool for finding ATAC-seq motifs, and the comprehensive server MMGAT-S makes a significant contribution to genomics research. The open-source code of MMGAT can be found at https://github.com/xiaotianr/MMGAT , and MMGAT-S is freely available at https://www.mmgraphws.com/MMGAT-S/ .

摘要

背景

在使用测序（ATAC-seq）数据进行转座酶可及染色质的测定中，基序发现对于揭示转录因子结合位点（TFBS）的复杂性及其在基因调控中的关键作用至关重要。深度学习技术，包括卷积神经网络（CNN）和图神经网络（GNN），在发现 ATAC-seq 基序方面取得了成功。然而，基于 CNN 的方法受到卷积核固定宽度的限制，这使得很难找到具有不同长度的多个转录因子结合位点。基于 GNN 的方法存在直接使用边权重信息的局限性，在表示节点嵌入时，很难更有效地聚合邻居节点的信息。

结果

为了解决这一挑战，我们开发了一种名为 MMGAT 的新型图注意网络框架，该框架采用注意机制来调整不同节点之间的注意系数。然后，MMGAT 根据序列节点和 k-mer 节点的注意系数以及 k-mer 的共存概率来发现多个 ATAC-seq 基序。与现有工具相比，我们的方法在人类 ATAC-seq 数据集上的性能更好，在精度、召回率、F1 得分、ACC、AUC 和 PRC 指标上的得分最高，并且发现了 389 个更高质量的基序。为了验证 MMGAT 在预测 TFBS 和在更多数据集上发现基序的性能，我们将人类 ATAC-seq 数据集的数量增加到 180，并新集成了 80 个小鼠 ATAC-seq 数据集进行多物种实验验证。具体来说，在小鼠 ATAC-seq 数据集上，MMGAT 在六个指标上也获得了最高分，并发现了 356 个更高质量的基序。为了方便研究人员使用 MMGAT，我们还开发了一个名为 MMGAT-S 的用户友好的网络服务器，该服务器托管 MMGAT 方法和 ATAC-seq 基序发现结果。

结论

先进的 MMGAT 方法为发现 ATAC-seq 基序提供了一个强大的工具，而全面的 MMGAT-S 服务器对基因组学研究做出了重要贡献。MMGAT 的开源代码可在 https://github.com/xiaotianr/MMGAT 上找到，而 MMGAT-S 可在 https://www.mmgraphws.com/MMGAT-S/ 上免费获得。