Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA.
Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA.
Nat Protoc. 2023 Dec;18(12):3690-3731. doi: 10.1038/s41596-023-00892-x. Epub 2023 Nov 21.
Non-negative matrix factorization (NMF) is an unsupervised learning method well suited to high-throughput biology. However, inferring biological processes from an NMF result still requires additional post hoc statistics and annotation for interpretation of learned features. Here, we introduce a suite of computational tools that implement NMF and provide methods for accurate and clear biological interpretation and analysis. A generalized discussion of NMF covering its benefits, limitations and open questions is followed by four procedures for the Bayesian NMF algorithm Coordinated Gene Activity across Pattern Subsets (CoGAPS). Each procedure will demonstrate NMF analysis to quantify cell state transitions in a public domain single-cell RNA-sequencing dataset. The first demonstrates PyCoGAPS, our new Python implementation that enhances runtime for large datasets, and the second allows its deployment in Docker. The third procedure steps through the same single-cell NMF analysis using our R CoGAPS interface. The fourth introduces a beginner-friendly CoGAPS platform using GenePattern Notebook, aimed at users with a working conceptual knowledge of data analysis but without a basic proficiency in the R or Python programming language. We also constructed a user-facing website to serve as a central repository for information and instructional materials about CoGAPS and its application programming interfaces. The expected timing to setup the packages and conduct a test run is around 15 min, and an additional 30 min to conduct analyses on a precomputed result. The expected runtime on the user's desired dataset can vary from hours to days depending on factors such as dataset size or input parameters.
非负矩阵分解 (NMF) 是一种非常适合高通量生物学的无监督学习方法。然而,要从 NMF 结果推断生物学过程,仍然需要额外的事后统计和注释来解释学习到的特征。在这里,我们引入了一套计算工具,实现了 NMF,并提供了准确和清晰的生物学解释和分析方法。首先对 NMF 进行了一般性讨论,涵盖了它的优点、局限性和悬而未决的问题,然后介绍了 Coordinated Gene Activity across Pattern Subsets (CoGAPS) 的贝叶斯 NMF 算法的四个程序。每个程序都将演示 NMF 分析,以量化公共领域单细胞 RNA-seq 数据集的细胞状态转变。第一个演示了 PyCoGAPS,这是我们新的 Python 实现,可提高大型数据集的运行时效率,第二个允许在 Docker 中部署它。第三个程序将使用我们的 R CoGAPS 接口逐步完成相同的单细胞 NMF 分析。第四个介绍了一个适合初学者的 CoGAPS 平台,使用 GenePattern Notebook,面向具有数据分析概念知识但不具备 R 或 Python 编程语言基本熟练程度的用户。我们还构建了一个面向用户的网站,作为关于 CoGAPS 及其应用程序编程接口的信息和教学材料的中央存储库。设置包并进行测试运行的预期时间约为 15 分钟,而在预计算结果上进行分析的额外时间为 30 分钟。根据数据集大小或输入参数等因素,在用户所需数据集上的预期运行时间可能从几小时到几天不等。