Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.
Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.
Bioinformatics. 2022 Oct 31;38(21):4885-4892. doi: 10.1093/bioinformatics/btac617.
MOTIVATION: Single-cell RNA sequencing (scRNA-seq) has been widely used to decompose complex tissues into functionally distinct cell types. The first and usually the most important step of scRNA-seq data analysis is to accurately annotate the cell labels. In recent years, many supervised annotation methods have been developed and shown to be more convenient and accurate than unsupervised cell clustering. One challenge faced by all the supervised annotation methods is the identification of the novel cell type, which is defined as the cell type that is not present in the training data, only exists in the testing data. Existing methods usually label the cells simply based on the correlation coefficients or confidence scores, which sometimes results in an excessive number of unlabeled cells. RESULTS: We developed a straightforward yet effective method combining autoencoder with iterative feature selection to automatically identify novel cells from scRNA-seq data. Our method trains an autoencoder with the labeled training data and applies the autoencoder to the testing data to obtain reconstruction errors. By iteratively selecting features that demonstrate a bi-modal pattern and reclustering the cells using the selected feature, our method can accurately identify novel cells that are not present in the training data. We further combined this approach with a support vector machine to provide a complete solution for annotating the full range of cell types. Extensive numerical experiments using five real scRNA-seq datasets demonstrated favorable performance of the proposed method over existing methods serving similar purposes. AVAILABILITY AND IMPLEMENTATION: Our R software package CAMLU is publicly available through the Zenodo repository (https://doi.org/10.5281/zenodo.7054422) or GitHub repository (https://github.com/ziyili20/CAMLU). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
动机:单细胞 RNA 测序(scRNA-seq)已被广泛用于将复杂组织分解为具有不同功能的细胞类型。scRNA-seq 数据分析的第一步,也是通常最重要的一步,是准确注释细胞标签。近年来,已经开发出许多有监督的注释方法,并且被证明比无监督的细胞聚类更加方便和准确。所有有监督的注释方法都面临的一个挑战是识别新的细胞类型,这是指在训练数据中不存在,仅存在于测试数据中的细胞类型。现有的方法通常只是基于相关系数或置信分数来标记细胞,这有时会导致大量未标记的细胞。
结果:我们开发了一种简单而有效的方法,将自动编码器与迭代特征选择相结合,从 scRNA-seq 数据中自动识别新的细胞。我们的方法使用标记的训练数据训练自动编码器,并将自动编码器应用于测试数据以获得重构误差。通过迭代选择表现出双峰模式的特征,并使用所选特征重新聚类细胞,我们的方法可以准确识别不在训练数据中的新细胞。我们进一步将这种方法与支持向量机结合,为注释全范围的细胞类型提供了一个完整的解决方案。使用五个真实的 scRNA-seq 数据集进行的广泛数值实验表明,与具有相似用途的现有方法相比,所提出的方法具有更好的性能。
可用性和实现:我们的 R 软件包 CAMLU 可通过 Zenodo 存储库(https://doi.org/10.5281/zenodo.7054422)或 GitHub 存储库(https://github.com/ziyili20/CAMLU)公开使用。
补充信息:补充数据可在生物信息学在线获得。
Bioinformatics. 2022-8-2
Bioinformatics. 2023-1-1
Brief Bioinform. 2024-1-22
Brief Bioinform. 2024-1-22
Bioinformatics. 2023-3-1
Brief Bioinform. 2025-5-1
Comput Struct Biotechnol J. 2025-4-2
Nat Commun. 2024-9-19
Methods Mol Biol. 2024
Nat Genet. 2021-9
Nat Biotechnol. 2022-1
Cell. 2021-6-24
Nat Commun. 2021-2-17
Nat Commun. 2021-2-15
Am J Physiol Cell Physiol. 2020-9-2