Jiangsu Key Lab of Big Data Security & Intelligent Processing, School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, China.
Jiangsu Key Lab of Big Data Security & Intelligent Processing, School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, China.
Comput Biol Med. 2024 Mar;171:108225. doi: 10.1016/j.compbiomed.2024.108225. Epub 2024 Feb 27.
Single-cell RNA sequencing (scRNA-seq) provides a powerful tool for exploring cellular heterogeneity, discovering novel or rare cell types, distinguishing between tissue-specific cellular composition, and understanding cell differentiation during development. However, due to technological limitations, dropout events in scRNA-seq can mistakenly convert some entries in the real data to zero. This is equivalent to introducing noise into the data of cell gene expression entries. The data is contaminated, which affects the performance of downstream analyses, including clustering, cell annotation, differential gene expression analysis, and so on. Therefore, it is a crucial work to accurately determine which zeros are due to dropout events and perform imputation operations on them.
Considering the different confidence levels of different zeros in the gene expression matrix, this paper proposes a SinCWIm method for dropout events in scRNA-seq based on weighted alternating least squares (WALS). The method utilizes Pearson correlation coefficient and hierarchical clustering to quantify the confidence of zero entries. It is then combined with WALS for matrix decomposition. And the imputation result is made close to the actual number by outlier removal and data correction operations.
A total of eight single-cell sequencing datasets were used for comparative experiments to demonstrate the overall superiority of SinCWIm over state-of-the-art models. SinCWIm was applied to cluster the data to obtain an adjusted RAND index evaluation, and the Usoskin, Pollen and Bladder datasets scored 94.46%, 96.48% and 76.74%, respectively. In addition, significant improvements were made in the retention of differential expression genes and visualization.
SinCWIm provides a valuable imputation method for handling dropout events in single-cell sequencing data. In comparison to advanced methods, SinCWIm demonstrates excellent performance in clustering, visualization and other aspects. It is applicable to various single-cell sequencing datasets.
单细胞 RNA 测序(scRNA-seq)为探索细胞异质性、发现新的或罕见的细胞类型、区分组织特异性细胞组成以及理解发育过程中的细胞分化提供了强大的工具。然而,由于技术限制,scRNA-seq 中的数据丢失事件可能会错误地将真实数据中的一些条目转换为零。这相当于在细胞基因表达条目的数据中引入噪声。数据受到污染,从而影响下游分析的性能,包括聚类、细胞注释、差异基因表达分析等。因此,准确确定哪些零是由于数据丢失事件引起的,并对其进行插补操作是一项至关重要的工作。
考虑到基因表达矩阵中不同零的置信水平不同,本文提出了一种基于加权交替最小二乘法(WALS)的 scRNA-seq 数据丢失事件的 SinCWIm 方法。该方法利用 Pearson 相关系数和层次聚类来量化零条目的确信度,然后将其与 WALS 结合进行矩阵分解。通过异常值去除和数据校正操作,使插补结果更接近实际数量。
总共使用了八个单细胞测序数据集进行比较实验,以证明 SinCWIm 总体上优于最先进的模型。应用 SinCWIm 对数据进行聚类,获得调整后的 Rand 指数评估,Usoskin、Pollen 和 Bladder 数据集的得分分别为 94.46%、96.48%和 76.74%。此外,在保留差异表达基因和可视化方面也取得了显著的改进。
SinCWIm 为处理单细胞测序数据中的数据丢失事件提供了一种有价值的插补方法。与先进的方法相比,SinCWIm 在聚类、可视化等方面表现出优异的性能,适用于各种单细胞测序数据集。