College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, P.R. China.
School of Mathematics and Statistics, Hainan Normal University, Haikou 570100, P.R. China.
Bioinformatics. 2020 May 1;36(10):3139-3147. doi: 10.1093/bioinformatics/btaa109.
Single-cell RNA-sequencing (scRNA-seq) technology provides a powerful tool for investigating cell heterogeneity and cell subpopulations by allowing the quantification of gene expression at single-cell level. However, scRNA-seq data analysis remains challenging because of various technical noises such as dropout events (i.e. excessive zero counts in the expression matrix).
By taking consideration of the association among cells and genes, we propose a novel collaborative matrix factorization-based method called CMF-Impute to impute the dropout entries of a given scRNA-seq expression matrix. We test CMF-Impute and compare it with the other five state-of-the-art methods on six popular real scRNA-seq datasets of various sizes and three simulated datasets. For simulated datasets, CMF-Impute outperforms other methods in imputing the closest dropouts to the original expression values as evaluated by both the sum of squared error and Pearson correlation coefficient. For real datasets, CMF-Impute achieves the most accurate cell classification results in spite of the choice of different clustering methods like SC3 or T-SNE followed by K-means as evaluated by both adjusted rand index and normalized mutual information. Finally, we demonstrate that CMF-Impute is powerful in reconstructing cell-to-cell and gene-to-gene correlation, and in inferring cell lineage trajectories.
CMF-Impute is written as a Matlab package which is available at https://github.com/xujunlin123/CMFImpute.git.
Supplementary data are available at Bioinformatics online.
单细胞 RNA 测序 (scRNA-seq) 技术通过允许在单细胞水平上定量基因表达,为研究细胞异质性和细胞亚群提供了强大的工具。然而,由于各种技术噪声,如缺失事件(即表达矩阵中过多的零计数),scRNA-seq 数据分析仍然具有挑战性。
通过考虑细胞和基因之间的关联,我们提出了一种新的基于协同矩阵分解的方法,称为 CMF-Impute,用于对给定的 scRNA-seq 表达矩阵中的缺失项进行插补。我们在六个大小不同的流行真实 scRNA-seq 数据集和三个模拟数据集上测试了 CMF-Impute 并将其与其他五种最先进的方法进行了比较。对于模拟数据集,CMF-Impute 在根据均方误差和 Pearson 相关系数评估的最接近原始表达值的缺失值插补上优于其他方法。对于真实数据集,CMF-Impute 尽管选择了不同的聚类方法,如 SC3 或 T-SNE 随后是 K-means,但仍能获得最准确的细胞分类结果,这是根据调整后的兰德指数和归一化互信息评估的。最后,我们证明了 CMF-Impute 在重建细胞间和基因间的相关性以及推断细胞谱系轨迹方面非常有效。
CMF-Impute 用 Matlab 编写,可在 https://github.com/xujunlin123/CMFImpute.git 获得。
补充数据可在生物信息学在线获得。