Suppr超能文献

annATAC:基于语言模型的单细胞染色质可及性测序数据自动细胞类型注释

annATAC: automatic cell type annotation for scATAC-seq data based on language model.

作者信息

Cui Lingyu, Wang Fang, Li Hongfei, Liu Qiaoming, Zhou Murong, Wang Guohua

机构信息

College of Life Science, Northeast Forestry University, Harbin, 150040, China.

The Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, 324000, China.

出版信息

BMC Biol. 2025 May 28;23(1):145. doi: 10.1186/s12915-025-02244-5.

Abstract

BACKGROUND

Cell type annotation serves as the cornerstone for downstream analysis of single cell data. Nevertheless, scATAC-seq data is characterized by high sparsity and dimensionality, presenting significant challenges to its annotation process.

RESULTS

We introduce a novel method based on language model, named annATAC, which is designed for the automatic annotation of cell types in scATAC-seq data. This method primarily consists of three stages. During the pre-training stage, by training on a vast amount of unlabeled data, the model can learn the interaction relationships between peaks, thus building a preliminary understanding of the data features. Subsequently, in the fine-tuning stage, a small quantity of labeled data is utilized to conduct secondary training on the model, which enables the model to identify cell types accurately. Finally, in the prediction stage, the trained model is applied to annotate scATAC-seq data.

CONCLUSIONS

Compared with other automatic annotation methods across multiple datasets, annATAC demonstrates superiority on the annotation performance. Further experiments have validated that annATAC holds great potential in identifying marker peaks and marker motifs. It is expected that annATAC will provide more profound and precise analysis outcomes for scATAC-seq research. As a result, it will effectively promote the progress of relevant biomedical research.

摘要

背景

细胞类型注释是单细胞数据下游分析的基石。然而,scATAC-seq数据具有高稀疏性和高维度的特点,这给其注释过程带来了重大挑战。

结果

我们引入了一种基于语言模型的新方法,名为annATAC,用于对scATAC-seq数据中的细胞类型进行自动注释。该方法主要包括三个阶段。在预训练阶段,通过对大量未标记数据进行训练,模型可以学习峰值之间的相互作用关系,从而初步了解数据特征。随后,在微调阶段,利用少量标记数据对模型进行二次训练,使模型能够准确识别细胞类型。最后,在预测阶段,将训练好的模型应用于注释scATAC-seq数据。

结论

与多个数据集上的其他自动注释方法相比,annATAC在注释性能上表现出优越性。进一步的实验验证了annATAC在识别标记峰值和标记基序方面具有巨大潜力。预计annATAC将为scATAC-seq研究提供更深入、精确的分析结果。因此,它将有效地推动相关生物医学研究的进展。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验