Department of Developmental Biology, University of Pittsburgh, Pittsburgh, PA, USA.
Department for Computational and Systems Biology, Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA.
Bioinformatics. 2018 Jul 1;34(13):i79-i88. doi: 10.1093/bioinformatics/bty260.
Genome-wide transcriptome sequencing applied to single cells (scRNA-seq) is rapidly becoming an assay of choice across many fields of biological and biomedical research. Scientific objectives often revolve around discovery or characterization of types or sub-types of cells, and therefore, obtaining accurate cell-cell similarities from scRNA-seq data is a critical step in many studies. While rapid advances are being made in the development of tools for scRNA-seq data analysis, few approaches exist that explicitly address this task. Furthermore, abundance and type of noise present in scRNA-seq datasets suggest that application of generic methods, or of methods developed for bulk RNA-seq data, is likely suboptimal.
Here, we present RAFSIL, a random forest based approach to learn cell-cell similarities from scRNA-seq data. RAFSIL implements a two-step procedure, where feature construction geared towards scRNA-seq data is followed by similarity learning. It is designed to be adaptable and expandable, and RAFSIL similarities can be used for typical exploratory data analysis tasks like dimension reduction, visualization and clustering. We show that our approach compares favorably with current methods across a diverse collection of datasets, and that it can be used to detect and highlight unwanted technical variation in scRNA-seq datasets in situations where other methods fail. Overall, RAFSIL implements a flexible approach yielding a useful tool that improves the analysis of scRNA-seq data.
The RAFSIL R package is available at www.kostkalab.net/software.html.
Supplementary data are available at Bioinformatics online.
全基因组转录组测序应用于单细胞(scRNA-seq)正在迅速成为许多生物学和生物医学研究领域的首选检测方法。科学目标通常围绕着细胞类型或亚型的发现或特征描述,因此,从 scRNA-seq 数据中获得准确的细胞间相似度是许多研究的关键步骤。虽然 scRNA-seq 数据分析工具的发展取得了快速进展,但很少有方法专门解决此任务。此外,scRNA-seq 数据集存在的噪声的丰富程度和类型表明,应用通用方法或为批量 RNA-seq 数据开发的方法可能不太理想。
在这里,我们提出了 RAFSIL,这是一种基于随机森林的方法,用于从 scRNA-seq 数据中学习细胞间的相似度。RAFSIL 实施了两步程序,其中针对 scRNA-seq 数据的特征构建紧随其后是相似性学习。它旨在具有适应性和可扩展性,并且 RAFSIL 相似度可用于典型的探索性数据分析任务,如降维、可视化和聚类。我们表明,我们的方法在各种数据集上与当前方法相比具有优势,并且它可用于在其他方法失败的情况下检测和突出 scRNA-seq 数据集中不需要的技术变化。总体而言,RAFSIL 实现了一种灵活的方法,生成了一个有用的工具,可改善 scRNA-seq 数据的分析。
RAFSIL R 包可在 www.kostkalab.net/software.html 获得。
补充数据可在 Bioinformatics 在线获得。