MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China; Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, 210023, China.
MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China; College of Computer Science and Technology, Nanjing Forestry University, Nanjing, 210037, China.
Comput Biol Med. 2022 Jul;146:105578. doi: 10.1016/j.compbiomed.2022.105578. Epub 2022 May 6.
Single-cell RNA sequencing (scRNA-seq) can reveal differences in genetic material at the single-cell level and is widely used in biomedical studies. However, the minute RNA content within individual cells often results in a high number of dropouts and introduces random noise of scRNA-seq data, concealing the original gene expression pattern. Therefore, data normalization is critical in the analysis pipeline to adjust for unexpected biological and technical effects, leading to a particular bimodal expression pattern exhibited in the semi-continuous normalized data. We further find the positive continuous expression presents a right-skewed distribution, which is still under-explored by mainstream dimensionality reduction and imputation methods. We introduced a deep autoencoder network based on a two-part-gamma model (DAE-TPGM) for joint dimensionality reduction and imputation of scRNA-seq data. DAE-TPGM uses a two-part-gamma model to capture the statistical characteristics of semi-continuous normalized data and adaptively explores the potential relationships between genes for promoting data imputation by deep autoencoder. Just as the classic application scenarios that use an autoencoder in dimensionality reduction, our personalized autoendoer can capture phenotypic information on the peripheral blood mononuclear cells (PBMC) better and clearly infer continuous phenotype information for hematopoiesis in mice. Compared with that of mainstream imputation methods such as MAGIC, SAVER, scImpute and DCA, the new model achieved substantial improvement on the recognition of cellular phenotypes in two real datasets, and the comprehensive analyses on synthetic "ground truth" data demonstrated that our method obtains competitive advantages over other imputation methods in discovering underlying gene expression patterns in time-course data.
单细胞 RNA 测序(scRNA-seq)可以揭示单细胞水平遗传物质的差异,广泛应用于生物医学研究中。然而,单个细胞内的微量 RNA 含量通常会导致大量数据缺失,并引入 scRNA-seq 数据的随机噪声,从而掩盖原始基因表达模式。因此,数据归一化在分析流程中至关重要,可以调整意外的生物学和技术效应,导致半连续归一化数据中表现出特定的双峰表达模式。我们进一步发现阳性连续表达呈现右偏分布,这仍然是主流降维和插补方法尚未探索的。我们引入了一种基于两部分伽马模型(DAE-TPGM)的深度自动编码器网络,用于联合 scRNA-seq 数据的降维和插补。DAE-TPGM 使用两部分伽马模型来捕获半连续归一化数据的统计特征,并通过深度自动编码器自适应地探索基因之间的潜在关系,以促进数据插补。就像在降维中使用自动编码器的经典应用场景一样,我们的个性化自动编码器可以更好地捕捉外周血单核细胞(PBMC)的表型信息,并清晰地推断出小鼠造血的连续表型信息。与 MAGIC、SAVER、scImpute 和 DCA 等主流插补方法相比,新模型在两个真实数据集的细胞表型识别方面取得了实质性的改进,对合成“真实”数据的综合分析表明,我们的方法在发现时间序列数据中的潜在基因表达模式方面优于其他插补方法。