scRCA：一种基于连体网络的流程，用于使用有噪声的单细胞RNA测序参考数据注释细胞类型。

scRCA: A Siamese network-based pipeline for annotating cell types using noisy single-cell RNA-seq reference data.

作者信息

Liu Yan, Li Chen, Shen Long-Chen, Yan He, Wei Guo, Gasser Robin B, Hu Xiaohua, Song Jiangning, Yu Dong-Jun

机构信息

Department of Computer Science, Yangzhou University, Yangzhou, 225100, China.

Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, 3800, Australia.

出版信息

Comput Biol Med. 2025 May;190:110068. doi: 10.1016/j.compbiomed.2025.110068. Epub 2025 Mar 29.

DOI:10.1016/j.compbiomed.2025.110068

PMID:40158457

Abstract

Accurate cell type annotation is fundamentally critical for single-cell sequencing (scRNA-seq) data analysis to provide insightful knowledge of tissue-specific cell heterogeneity and cell state transition tracking. Cell type annotation is usually conducted by comparative analysis with known data (i.e., reference) - which contains a presumably accurate representation of cell types. However, this assumption is often problematic, as factors such as human errors in wet-lab experiments and methodological limitations can introduce annotation errors in the reference dataset. As current pipelines for single-cell transcriptomic analysis do not adequately consider this challenge, there is a major demand for constructing a computational pipeline that achieves high-quality cell type annotation using reference datasets containing inherent errors (referred to as "noise" in this study). Here, we built a Siamese network-based pipeline, termed scRCA, to accurately annotate cell types based on noisy reference data. To help users evaluate the reliability of scRCA annotations, an interpreter was also developed to explore the factors underlying the model's predictions. Our experiments demonstrate that, across 14 datasets, scRCA outperformed other widely adopted reference-based methods for cell type annotation. Using an independent dataset of four multiple myeloma patients, we further illustrated that scRCA can distinguish cancerous cells based on gene expression levels and identify genes closely associated with multiple myeloma through scRCA's interpretable module, providing significant information for subsequent clinical treatments. With these advancements, we anticipate that scRCA will serve as a practical reference-based approach for accurate annotating cell type annotation.

摘要

准确的细胞类型注释对于单细胞测序（scRNA-seq）数据分析至关重要，它能提供有关组织特异性细胞异质性和细胞状态转变追踪的深刻见解。细胞类型注释通常通过与已知数据（即参考数据）进行比较分析来进行，参考数据包含细胞类型的大概准确表示。然而，这个假设往往存在问题，因为诸如湿实验室实验中的人为错误和方法学限制等因素会在参考数据集中引入注释错误。由于当前的单细胞转录组分析流程没有充分考虑这一挑战，因此迫切需要构建一种计算流程，该流程能够使用包含固有错误（在本研究中称为“噪声”）的参考数据集实现高质量的细胞类型注释。在这里，我们构建了一个基于暹罗网络的流程，称为scRCA，以基于有噪声的参考数据准确注释细胞类型。为了帮助用户评估scRCA注释的可靠性，还开发了一个解释器来探索模型预测背后的因素。我们的实验表明，在14个数据集上，scRCA在细胞类型注释方面优于其他广泛采用的基于参考的方法。使用来自四名多发性骨髓瘤患者的独立数据集，我们进一步说明scRCA可以根据基因表达水平区分癌细胞，并通过scRCA的可解释模块识别与多发性骨髓瘤密切相关的基因，为后续临床治疗提供重要信息。有了这些进展，我们预计scRCA将成为一种实用的基于参考的方法，用于准确注释细胞类型。