Zhang Hongyu, Li Weining, Guan Jinting
Department of Automation, Xiamen University, Xiang'an South Road, Xiang'an District, Xiamen, Fujian 361102, China.
Key Laboratory of System Control and Information Processing, Ministry of Education, Dongchuan Road, Minhang District, Shanghai, Shanghai 200240, China.
Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf298.
Single-cell RNA-seq facilitates the understanding of cell types and states and the revealing of the cellular heterogeneity in developmental processes and disease mechanisms. However, the dropout events in single-cell RNA-seq data, in which genes are not detected due to technical noise or limited sequencing depth, seriously affect downstream analyses. Imputation is an effective way to relieve the impact of dropout events. However, the current methods may introduce new noise or modify the high expression values in the imputation process and their performance may be lower than expected when dealing with data with a high dropout rate, facing with different types of data, and aiming at various downstream analyses. We propose a two-stage imputation algorithm, scTsI, for single-cell RNA-seq data. In the first stage, scTsI imputes the zero values using the information of neighboring cells and genes. In the second stage, scTsI transforms the expression matrix into a vector, performs row transformation, and adjusts the imputed values through ridge regression and leveraging bulk RNA-seq data as a constraint. scTsI ensures that the original highly expressed values are unchanged, avoids introducing new noise, and allows sparse matrix input to accelerate imputation. We conduct experiments on a variety of simulated and real data with different dropout rates and compare scTsI with the commonly used imputation methods. The results show that scTsI can restore gene expression and maintain cell-cell similarity across different data dimensions and dropout rates. scTsI can also improve the performance of data visualization, clustering, and cell trajectory inference.
单细胞RNA测序有助于理解细胞类型和状态,并揭示发育过程和疾病机制中的细胞异质性。然而,单细胞RNA测序数据中的缺失事件(即由于技术噪声或有限的测序深度而未检测到基因)严重影响下游分析。插补是减轻缺失事件影响的有效方法。然而,当前的方法可能会在插补过程中引入新的噪声或修改高表达值,并且在处理高缺失率的数据、面对不同类型的数据以及针对各种下游分析时,其性能可能低于预期。我们提出了一种用于单细胞RNA测序数据的两阶段插补算法scTsI。在第一阶段,scTsI利用相邻细胞和基因的信息插补零值。在第二阶段,scTsI将表达矩阵转换为向量,进行行变换,并通过岭回归并以批量RNA测序数据作为约束来调整插补值。scTsI确保原始的高表达值不变,避免引入新的噪声,并允许稀疏矩阵输入以加速插补。我们对具有不同缺失率的各种模拟数据和真实数据进行了实验,并将scTsI与常用的插补方法进行了比较。结果表明,scTsI可以恢复基因表达,并在不同的数据维度和缺失率下保持细胞间的相似性。scTsI还可以提高数据可视化、聚类和细胞轨迹推断的性能。