Shiga Mikio, Seno Shigeto, Onizuka Makoto, Matsuda Hideo
Graduate School of Information Science and Technology, Osaka University, Osaka, Japan.
PeerJ. 2021 Aug 27;9:e12087. doi: 10.7717/peerj.12087. eCollection 2021.
Single-cell RNA-sequencing is a rapidly evolving technology that enables us to understand biological processes at unprecedented resolution. Single-cell expression analysis requires a complex data processing pipeline, and the pipeline is divided into two main parts: The quantification part, which converts the sequence information into gene-cell matrix data; the analysis part, which analyzes the matrix data using statistics and/or machine learning techniques. In the analysis part, unsupervised cell clustering plays an important role in identifying cell types and discovering cell diversity and subpopulations. Identified cell clusters are also used for subsequent analysis, such as finding differentially expressed genes and inferring cell trajectories. However, single-cell clustering using gene expression profiles shows different results depending on the quantification methods. Clustering results are greatly affected by the quantification method used in the upstream process. In other words, even if the original RNA-sequence data is the same, gene expression profiles processed by different quantification methods will produce different clusters. In this article, we propose a robust and highly accurate clustering method based on joint non-negative matrix factorization (joint-NMF) by utilizing the information from multiple gene expression profiles quantified using different methods from the same RNA-sequence data. Our joint-NMF can extract common factors among multiple gene expression profiles by applying each NMF under the constraint that one of the factorized matrices is shared among multiple NMFs. The joint-NMF determines more robust and accurate cell clustering results by leveraging multiple quantification methods compared to conventional clustering methods, which use only a single gene expression profile. Additionally, we showed the usefulness of discovering marker genes with the extracted features using our method.
单细胞RNA测序是一项快速发展的技术,它使我们能够以前所未有的分辨率理解生物过程。单细胞表达分析需要一个复杂的数据处理流程,该流程主要分为两个部分:量化部分,将序列信息转化为基因-细胞矩阵数据;分析部分,使用统计和/或机器学习技术分析矩阵数据。在分析部分,无监督细胞聚类在识别细胞类型、发现细胞多样性和亚群方面发挥着重要作用。识别出的细胞簇也用于后续分析,如寻找差异表达基因和推断细胞轨迹。然而,使用基因表达谱进行单细胞聚类时,根据量化方法的不同会显示出不同的结果。聚类结果受上游过程中使用的量化方法的影响很大。换句话说,即使原始RNA序列数据相同,不同量化方法处理的基因表达谱也会产生不同的聚类。在本文中,我们通过利用从同一RNA序列数据中使用不同方法量化得到的多个基因表达谱中的信息,提出了一种基于联合非负矩阵分解(joint-NMF)的稳健且高度准确的聚类方法。我们的联合非负矩阵分解可以通过在多个非负矩阵分解中共享其中一个分解矩阵的约束下应用每个非负矩阵分解,来提取多个基因表达谱中的共同因子。与仅使用单个基因表达谱的传统聚类方法相比,联合非负矩阵分解通过利用多种量化方法确定了更稳健、准确的细胞聚类结果。此外,我们展示了使用我们的方法通过提取的特征发现标记基因的实用性。