Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea.
Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.
Nucleic Acids Res. 2022 Jul 8;50(12):e71. doi: 10.1093/nar/gkac216.
The standard analysis pipeline for single-cell RNA-seq data consists of sequential steps initiated by clustering the cells. An innate limitation of this pipeline is that an imperfect clustering result can irreversibly affect the succeeding steps. For example, there can be cell types not well distinguished by clustering because they largely share the global structure, such as the anterior primitive streak and mid primitive streak cells. If one searches differentially expressed genes (DEGs) solely based on clustering, marker genes for distinguishing these types will be missed. Moreover, clustering depends on many parameters and can often be subjective to manual decisions. To overcome these limitations, we propose MarcoPolo, a method that identifies informative DEGs independently of prior clustering. MarcoPolo sorts out genes by evaluating if the distributions are bimodal, if similar expression patterns are observed in other genes, and if the expressing cells are proximal in a low-dimensional space. Using real datasets with FACS-purified cell labels, we demonstrate that MarcoPolo recovers marker genes better than competing methods. Notably, MarcoPolo finds key genes that can distinguish cell types that are not distinguishable by the standard clustering. MarcoPolo is built in a convenient software package that provides analysis results in an HTML file.
单细胞 RNA-seq 数据的标准分析流程包括通过对细胞进行聚类来启动的一系列步骤。该流程存在一个固有缺陷,即聚类结果不理想可能会不可逆地影响后续步骤。例如,由于它们在很大程度上共享全局结构,因此某些细胞类型可能无法通过聚类很好地区分,例如前原条带和中胚层原条带细胞。如果仅基于聚类来搜索差异表达基因 (DEG),则会错过用于区分这些类型的标记基因。此外,聚类取决于许多参数,并且通常容易受到手动决策的影响。为了克服这些限制,我们提出了 MarcoPolo 方法,该方法可以在不依赖于先前聚类的情况下识别信息丰富的 DEG。MarcoPolo 通过评估基因的分布是否呈双峰分布、其他基因是否观察到相似的表达模式以及在低维空间中表达细胞是否接近来对基因进行排序。使用具有 FACS 纯化细胞标签的真实数据集,我们证明了 MarcoPolo 比竞争方法更好地恢复了标记基因。值得注意的是,MarcoPolo 发现了可以区分无法通过标准聚类区分的细胞类型的关键基因。MarcoPolo 构建在一个方便的软件包中,该软件包以 HTML 文件形式提供分析结果。