annATAC：基于语言模型的单细胞染色质可及性测序数据自动细胞类型注释

annATAC: automatic cell type annotation for scATAC-seq data based on language model.

作者信息

Cui Lingyu, Wang Fang, Li Hongfei, Liu Qiaoming, Zhou Murong, Wang Guohua

机构信息

College of Life Science, Northeast Forestry University, Harbin, 150040, China.

The Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, 324000, China.

出版信息

BMC Biol. 2025 May 28;23(1):145. doi: 10.1186/s12915-025-02244-5.

DOI:10.1186/s12915-025-02244-5

PMID:40437567

Abstract

BACKGROUND

Cell type annotation serves as the cornerstone for downstream analysis of single cell data. Nevertheless, scATAC-seq data is characterized by high sparsity and dimensionality, presenting significant challenges to its annotation process.

RESULTS

We introduce a novel method based on language model, named annATAC, which is designed for the automatic annotation of cell types in scATAC-seq data. This method primarily consists of three stages. During the pre-training stage, by training on a vast amount of unlabeled data, the model can learn the interaction relationships between peaks, thus building a preliminary understanding of the data features. Subsequently, in the fine-tuning stage, a small quantity of labeled data is utilized to conduct secondary training on the model, which enables the model to identify cell types accurately. Finally, in the prediction stage, the trained model is applied to annotate scATAC-seq data.

CONCLUSIONS

Compared with other automatic annotation methods across multiple datasets, annATAC demonstrates superiority on the annotation performance. Further experiments have validated that annATAC holds great potential in identifying marker peaks and marker motifs. It is expected that annATAC will provide more profound and precise analysis outcomes for scATAC-seq research. As a result, it will effectively promote the progress of relevant biomedical research.

摘要

背景

细胞类型注释是单细胞数据下游分析的基石。然而，scATAC-seq数据具有高稀疏性和高维度的特点，这给其注释过程带来了重大挑战。

结果

我们引入了一种基于语言模型的新方法，名为annATAC，用于对scATAC-seq数据中的细胞类型进行自动注释。该方法主要包括三个阶段。在预训练阶段，通过对大量未标记数据进行训练，模型可以学习峰值之间的相互作用关系，从而初步了解数据特征。随后，在微调阶段，利用少量标记数据对模型进行二次训练，使模型能够准确识别细胞类型。最后，在预测阶段，将训练好的模型应用于注释scATAC-seq数据。