Wang Zihao, Wu Zeyu, Deng Minghua
Biomedical Interdisciplinary Research Center, Peking University, Yiheyuan Road, Beijing, 100871, China.
School of Mathematical Sciences, Peking University, Yiheyuan Road, Beijing, 100871, China.
BMC Bioinformatics. 2025 Aug 4;26(1):206. doi: 10.1186/s12859-025-06239-5.
As single-cell sequencing technology became widely used, scientists found that single-modality data alone could not fully meet the research needs of complex biological systems. To address this issue, researchers began simultaneously collect multi-modal single-cell omics data. But different sequencing technologies often result in datasets where one or more data modalities are missing. Therefore, mosaic datasets are more common when we analyze. However, the high dimensionality and sparsity of the data increase the difficulty, and the presence of batch effects poses an additional challenge. To address these challenges, we proposes a flexible integration framework based on Variational Autoencoder called scGCM. The main task of scGCM is to integrate single-cell multimodal mosaic data and eliminate batch effects. This method was conducted on multiple datasets, encompassing different modalities of single-cell data. The results demonstrate that, compared to state-of-the-art multimodal data integration methods, scGCM offers significant advantages in clustering accuracy and data consistency. The source code of scGCM can be accessed at https://github.com/closmouz/scCGM .
随着单细胞测序技术的广泛应用,科学家们发现仅靠单模态数据无法完全满足复杂生物系统的研究需求。为了解决这个问题,研究人员开始同时收集多模态单细胞组学数据。但不同的测序技术常常导致数据集中存在一种或多种数据模态缺失的情况。因此,在我们进行分析时,镶嵌数据集更为常见。然而,数据的高维度和稀疏性增加了难度,并且批次效应的存在带来了额外的挑战。为应对这些挑战,我们提出了一种基于变分自编码器的灵活整合框架,称为scGCM。scGCM的主要任务是整合单细胞多模态镶嵌数据并消除批次效应。该方法在多个数据集上进行了测试,涵盖了不同模态的单细胞数据。结果表明,与当前最先进的多模态数据整合方法相比,scGCM在聚类准确性和数据一致性方面具有显著优势。scGCM的源代码可在https://github.com/closmouz/scCGM上获取。