Zhao Shuchang, Zhang Li, Liu Xuejun
MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106 China.
Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, 210023 China.
Front Comput Sci (Berl). 2023;17(3):173902. doi: 10.1007/s11704-022-2011-y. Epub 2022 Oct 26.
Single-cell RNA sequencing (scRNA-seq) technology has become an effective tool for high-throughout transcriptomic study, which circumvents the averaging artifacts corresponding to bulk RNA-seq technology, yielding new perspectives on the cellular diversity of potential superficially homogeneous populations. Although various sequencing techniques have decreased the amplification bias and improved capture efficiency caused by the low amount of starting material, the technical noise and biological variation are inevitably introduced into experimental process, resulting in high dropout events, which greatly hinder the downstream analysis. Considering the bimodal expression pattern and the right-skewed characteristic existed in normalized scRNA-seq data, we propose a customized autoencoder based on a two-part-generalized-gamma distribution (AE-TPGG) for scRNA-seq data analysis, which takes mixed discrete-continuous random variables of scRNA-seq data into account using a two-part model and utilizes the generalized gamma (GG) distribution, for fitting the positive and right-skewed continuous data. The adopted autoencoder enables AE-TPGG to captures the inherent relationship between genes. In addition to the ability of achieving low-dimensional representation, the AE-TPGG model also provides a denoised imputation according to statistical characteristic of gene expression. Results on real datasets demonstrate that our proposed model is competitive to current imputation methods and ameliorates a diverse set of typical scRNA-seq data analyses.
Supplementary material is available in the online version of this article at 10.1007/s11704-022-2011-y.
单细胞RNA测序(scRNA-seq)技术已成为高通量转录组学研究的有效工具,它规避了与批量RNA-seq技术相对应的平均假象,为潜在表面均匀群体的细胞多样性带来了新的视角。尽管各种测序技术已经降低了扩增偏差并提高了因起始材料量少而导致的捕获效率,但技术噪声和生物学变异不可避免地被引入实验过程中,导致高缺失事件,这极大地阻碍了下游分析。考虑到标准化scRNA-seq数据中存在的双峰表达模式和右偏特征,我们提出了一种基于两部分广义伽马分布(AE-TPGG)的定制自动编码器用于scRNA-seq数据分析,该方法使用两部分模型考虑scRNA-seq数据的混合离散-连续随机变量,并利用广义伽马(GG)分布来拟合正的和右偏的连续数据。所采用的自动编码器使AE-TPGG能够捕捉基因之间的内在关系。除了能够实现低维表示外,AE-TPGG模型还根据基因表达的统计特征提供去噪插补。真实数据集的结果表明,我们提出的模型与当前的插补方法相比具有竞争力,并改善了各种典型的scRNA-seq数据分析。
补充材料可在本文的在线版本中获取,链接为10.1007/s11704-022-2011-y。