PbImpute：单细胞RNA测序数据中的精确零判别与平衡插补

PbImpute: Precise Zero Discrimination and Balanced Imputation in Single-Cell RNA Sequencing Data.

作者信息

Zhang Yi, Wang Yin, Liu Xinyuan, Feng Xi

机构信息

School of Computer Science and Engineering, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China.

Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China.

出版信息

J Chem Inf Model. 2025 Mar 10;65(5):2670-2684. doi: 10.1021/acs.jcim.4c02125. Epub 2025 Feb 17.

DOI:10.1021/acs.jcim.4c02125

PMID:39957720

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11898086/

Abstract

Single-cell RNA sequencing (scRNA-seq) has emerged as a transformative technology for elucidating cellular heterogeneity at unprecedented resolution. However, technical limitations such as limited sequencing depth and mRNA capture efficiency often result in zero counts, commonly referred to as "dropout zeros" in scRNA-seq data. These zeros pose significant challenges to downstream analysis, as they can distort the interpretation of cellular transcriptomes. While numerous computational methods have been developed to address this challenge, existing approaches frequently suffer from either insufficient imputation of zeros (under-imputation) or excessive modification of zeros (over-imputation). Here, we propose a precisely balanced imputation (PbImpute) method designed to achieve optimal equilibrium between dropout recovery and biological zero preservation in scRNA-seq data. PbImpute employs a multistage approach: (1) Initial discrimination between technical dropouts and biological zeros through parameter optimization of a new zero-inflated negative binomial (ZINB) distribution model, followed by initial imputation; (2) Application of a uniquely designed static repair algorithm to enhance data fidelity; (3) Secondary dropout identification based on gene expression frequency and partition-specific coefficient of variation; (4) Graph-embedding neural network-based imputation; and (5) Implementation of a uniquely designed dynamic repair mechanism to mitigate over-imputation effects. PbImpute distinguishes itself by uniquely integrating ZINB modeling with static and dynamic repair. This advantageous combined approach achieves a balance between over- and under-imputation, while simultaneously preserving true biological zeros and reducing signal distortion. Comprehensive evaluation using both simulated and real scRNA-seq data sets demonstrated that PbImpute achieves superior performance (F1 Score = 0.88 at 83% dropout rate, ARI = 0.78 on PBMC) in discriminating between technical dropouts and biological zeros compared to state-of-the-art methods. The method significantly improves gene-gene and cell-cell correlation structures, enhances differential expression analysis sensitivity, optimizes clustering resolution and dimensional reduction visualization, and facilitates more accurate trajectory inference. Ablation studies confirmed the essential contribution of both the imputation and repair modules to the method's performance. The code is available at https://github.com/WyBioTeam/PbImpute. By enhancing the accuracy of scRNA-seq data imputation, PbImpute can improve the identification of cell subpopulations and the detection of differentially expressed genes, thereby facilitating more precise analyses of cellular heterogeneity and advancing disease research.

摘要

单细胞RNA测序（scRNA-seq）已成为一项变革性技术，能够以前所未有的分辨率阐明细胞异质性。然而，诸如测序深度有限和mRNA捕获效率等技术限制常常导致计数为零，在scRNA-seq数据中通常称为“缺失零值”。这些零值给下游分析带来了重大挑战，因为它们会扭曲对细胞转录组的解读。虽然已经开发了许多计算方法来应对这一挑战，但现有方法常常要么对零值的插补不足（插补不足），要么对零值的修改过度（插补过度）。在此，我们提出一种精确平衡插补（PbImpute）方法，旨在在scRNA-seq数据的缺失恢复和生物学零值保留之间实现最佳平衡。PbImpute采用多阶段方法：（1）通过对新的零膨胀负二项式（ZINB）分布模型进行参数优化，初步区分技术缺失值和生物学零值，随后进行初步插补；（2）应用独特设计的静态修复算法以提高数据保真度；（3）基于基因表达频率和分区特异性变异系数进行二次缺失值识别；（4）基于图嵌入神经网络的插补；以及（5）实施独特设计的动态修复机制以减轻插补过度的影响。PbImpute的独特之处在于将ZINB建模与静态和动态修复独特地整合在一起。这种有利的组合方法在插补过度和不足之间实现了平衡，同时保留了真正的生物学零值并减少了信号失真。使用模拟和真实scRNA-seq数据集进行的综合评估表明，与现有最先进方法相比，PbImpute在区分技术缺失值和生物学零值方面具有卓越性能（在83%缺失率下F1分数 = 0.88，在PBMC上ARI = 0.78）。该方法显著改善了基因 - 基因和细胞 - 细胞的相关结构，提高了差异表达分析的灵敏度，优化了聚类分辨率和降维可视化，并促进了更准确的轨迹推断。消融研究证实了插补和修复模块对该方法性能的重要贡献。代码可在https://github.com/WyBioTeam/PbImpute获取。通过提高scRNA-seq数据插补的准确性，PbImpute可以改善细胞亚群的识别和差异表达基因的检测，从而促进对细胞异质性的更精确分析并推动疾病研究。