Shen Yifei, Chu Qinjie, Timko Michael P, Fan Longjiang
China Department of Laboratory Medicine, First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou 310003, China.
China Key Laboratory of Clinical In Vitro Diagnostic Techniques of Zhejiang Province, Hangzhou 310003, China.
Bioinformatics. 2021 Nov 18;37(22):4115-4122. doi: 10.1093/bioinformatics/btab410.
Single-cell RNA sequencing (scRNA-seq) has enabled the characterization of different cell types in many tissues and tumor samples. Cell type identification is essential for single-cell RNA profiling, currently transforming the life sciences. Often, this is achieved by searching for combinations of genes that have previously been implicated as being cell-type specific, an approach that is not quantitative and does not explicitly take advantage of other scRNA-seq studies. Batch effects and different data platforms greatly decrease the predictive performance in inter-laboratory and different data type validation.
Here, we present a new ensemble learning method named as 'scDetect' that combines gene expression rank-based analysis and a majority vote ensemble machine-learning probability-based prediction method capable of highly accurate classification of cells based on scRNA-seq data by different sequencing platforms. Because of tumor heterogeneity, in order to accurately predict tumor cells in the single-cell RNA-seq data, we have also incorporated cell copy number variation consensus clustering and epithelial score in the classification. We applied scDetect to scRNA-seq data from pancreatic tissue, mononuclear cells and tumor biopsies cells and show that scDetect classified individual cells with high accuracy and better than other publicly available tools.
scDetect is an open source software. Source code and test data is freely available from Github (https://github.com/IVDgenomicslab/scDetect/) and Zenodo (https://zenodo.org/record/4764132#.YKCOlrH5AYN). The examples and tutorial page is at https://ivdgenomicslab.github.io/scDetect-Introduction/. And scDetect will be available from Bioconductor.
Supplementary data are available at Bioinformatics online.
单细胞RNA测序(scRNA-seq)已能够对许多组织和肿瘤样本中的不同细胞类型进行表征。细胞类型识别对于单细胞RNA分析至关重要,目前正在改变生命科学。通常,这是通过寻找先前被认为是细胞类型特异性的基因组合来实现的,这种方法不是定量的,也没有明确利用其他scRNA-seq研究。批次效应和不同的数据平台大大降低了实验室间和不同数据类型验证中的预测性能。
在此,我们提出了一种名为“scDetect”的新集成学习方法,该方法结合了基于基因表达排名的分析和基于多数投票集成机器学习概率的预测方法,能够基于不同测序平台的scRNA-seq数据对细胞进行高精度分类。由于肿瘤异质性,为了准确预测单细胞RNA-seq数据中的肿瘤细胞,我们还在分类中纳入了细胞拷贝数变异共识聚类和上皮评分。我们将scDetect应用于胰腺组织、单核细胞和肿瘤活检细胞的scRNA-seq数据,并表明scDetect能够高精度地对单个细胞进行分类,且优于其他公开可用的工具。
scDetect是一个开源软件。源代码和测试数据可从Github(https://github.com/IVDgenomicslab/scDetect/)和Zenodo(https://zenodo.org/record/4764132#.YKCOlrH5AYN)免费获取。示例和教程页面位于https://ivdgenomicslab.github.io/scDetect-Introduction/。并且scDetect将可从Bioconductor获得。
补充数据可在《生物信息学》在线获取。