School of Big Data and Software Engineering, Chongqing University, Chongqing, China.
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China.
Nat Comput Sci. 2024 Apr;4(4):285-298. doi: 10.1038/s43588-024-00622-7. Epub 2024 Apr 10.
The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) technology provides insight into gene regulation and epigenetic heterogeneity at single-cell resolution, but cell annotation from scATAC-seq remains challenging due to high dimensionality and extreme sparsity within the data. Existing cell annotation methods mostly focus on the cell peak matrix without fully utilizing the underlying genomic sequence. Here we propose a method, SANGO, for accurate single-cell annotation by integrating genome sequences around the accessibility peaks within scATAC data. The genome sequences of peaks are encoded into low-dimensional embeddings, and then iteratively used to reconstruct the peak statistics of cells through a fully connected network. The learned weights are considered as regulatory modes to represent cells, and utilized to align the query cells and the annotated cells in the reference data through a graph transformer network for cell annotations. SANGO was demonstrated to consistently outperform competing methods on 55 paired scATAC-seq datasets across samples, platforms and tissues. SANGO was also shown to be able to detect unknown tumor cells through attention edge weights learned by the graph transformer. Moreover, from the annotated cells, we found cell-type-specific peaks that provide functional insights/biological signals through expression enrichment analysis, cis-regulatory chromatin interaction analysis and motif enrichment analysis.
使用测序技术进行转座酶可及染色质的单细胞分析(scATAC-seq)技术可深入了解单细胞分辨率下的基因调控和表观遗传异质性,但由于数据的高维性和极度稀疏性,scATAC-seq 中的细胞注释仍然具有挑战性。现有的细胞注释方法主要集中在细胞峰矩阵上,而没有充分利用底层基因组序列。在这里,我们提出了一种方法 SANGO,通过整合 scATAC 数据中可及性峰周围的基因组序列,实现准确的单细胞注释。峰的基因组序列被编码为低维嵌入,然后通过全连接网络迭代用于通过重构细胞的峰统计信息。所学习到的权重被认为是表示细胞的调节模式,并通过图变换网络用于查询细胞和参考数据中注释的细胞的对齐,以进行细胞注释。SANGO 在 55 对跨样本、平台和组织的配对 scATAC-seq 数据集上的表现始终优于竞争方法。SANGO 还能够通过图变换学习到的注意力边权重来检测未知的肿瘤细胞。此外,从注释的细胞中,我们发现了通过表达富集分析、顺式调控染色质相互作用分析和基序富集分析提供功能见解/生物学信号的细胞类型特异性峰。