Suppr超能文献

annATAC:基于语言模型的单细胞染色质可及性测序数据自动细胞类型注释

annATAC: automatic cell type annotation for scATAC-seq data based on language model.

作者信息

Cui Lingyu, Wang Fang, Li Hongfei, Liu Qiaoming, Zhou Murong, Wang Guohua

机构信息

College of Life Science, Northeast Forestry University, Harbin, 150040, China.

The Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, 324000, China.

出版信息

BMC Biol. 2025 May 28;23(1):145. doi: 10.1186/s12915-025-02244-5.

Abstract

BACKGROUND

Cell type annotation serves as the cornerstone for downstream analysis of single cell data. Nevertheless, scATAC-seq data is characterized by high sparsity and dimensionality, presenting significant challenges to its annotation process.

RESULTS

We introduce a novel method based on language model, named annATAC, which is designed for the automatic annotation of cell types in scATAC-seq data. This method primarily consists of three stages. During the pre-training stage, by training on a vast amount of unlabeled data, the model can learn the interaction relationships between peaks, thus building a preliminary understanding of the data features. Subsequently, in the fine-tuning stage, a small quantity of labeled data is utilized to conduct secondary training on the model, which enables the model to identify cell types accurately. Finally, in the prediction stage, the trained model is applied to annotate scATAC-seq data.

CONCLUSIONS

Compared with other automatic annotation methods across multiple datasets, annATAC demonstrates superiority on the annotation performance. Further experiments have validated that annATAC holds great potential in identifying marker peaks and marker motifs. It is expected that annATAC will provide more profound and precise analysis outcomes for scATAC-seq research. As a result, it will effectively promote the progress of relevant biomedical research.

摘要

背景

细胞类型注释是单细胞数据下游分析的基石。然而,scATAC-seq数据具有高稀疏性和高维度的特点,这给其注释过程带来了重大挑战。

结果

我们引入了一种基于语言模型的新方法,名为annATAC,用于对scATAC-seq数据中的细胞类型进行自动注释。该方法主要包括三个阶段。在预训练阶段,通过对大量未标记数据进行训练,模型可以学习峰值之间的相互作用关系,从而初步了解数据特征。随后,在微调阶段,利用少量标记数据对模型进行二次训练,使模型能够准确识别细胞类型。最后,在预测阶段,将训练好的模型应用于注释scATAC-seq数据。

结论

与多个数据集上的其他自动注释方法相比,annATAC在注释性能上表现出优越性。进一步的实验验证了annATAC在识别标记峰值和标记基序方面具有巨大潜力。预计annATAC将为scATAC-seq研究提供更深入、精确的分析结果。因此,它将有效地推动相关生物医学研究的进展。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验