Theunissen Lauren, Mortier Thomas, Saeys Yvan, Waegeman Willem
Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research and VIB Center for AI and Computational Biology (VIB.AI), 9000 Ghent, Belgium.
Department of Data-analysis and Mathematical Modeling, Ghent University Faculty of Bioscience Engineering, 9000 Ghent, Belgium.
Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf239.
Automatic cell-type annotation methods assign cell-type labels to new, unlabeled datasets by leveraging relationships from a reference RNA-seq atlas. However, new datasets may include labels absent from the reference dataset or exhibit feature distributions that diverge from it. These scenarios can significantly affect the reliability of cell type predictions, a factor often overlooked in current automatic annotation methods. The field of out-of-distribution detection (OOD), primarily focused on computer vision, addresses the identification of instances that differ from the training distribution. Therefore, the implementation of OOD methods in the context of novel cell type annotation and data shift detection for single-cell transcriptomics may enhance annotation accuracy and trustworthiness. We evaluate six OOD detection methods: LogitNorm, MC dropout, Deep Ensembles, Energy-based OOD, Deep NN, and Posterior networks, for their annotation and OOD detection performance in both synthetical and real-life application settings. We show that OOD detection methods can accurately identify novel cell types and demonstrate potential to detect significant data shifts in non-integrated datasets. Moreover, we find that integration of the OOD datasets does not interfere with OOD detection of novel cell types.
自动细胞类型注释方法通过利用来自参考RNA测序图谱的关系,将细胞类型标签分配给新的未标记数据集。然而,新数据集可能包含参考数据集中不存在的标签,或者呈现出与参考数据集不同的特征分布。这些情况会显著影响细胞类型预测的可靠性,而这一因素在当前的自动注释方法中常常被忽视。分布外检测(OOD)领域主要专注于计算机视觉,致力于识别与训练分布不同的实例。因此,在单细胞转录组学的新型细胞类型注释和数据偏移检测中实施OOD方法,可能会提高注释的准确性和可信度。我们评估了六种OOD检测方法:LogitNorm、MC dropout、深度集成、基于能量的OOD、深度神经网络和后验网络,考察它们在合成和实际应用场景中的注释和OOD检测性能。我们表明,OOD检测方法能够准确识别新型细胞类型,并展示出在非整合数据集中检测显著数据偏移的潜力。此外,我们发现OOD数据集的整合不会干扰新型细胞类型的OOD检测。