细胞类型注释模型选择：单细胞 RNA-Seq 数据中的通用型与模式感知特征基因选择

Cell Type Annotation Model Selection: General-Purpose vs. Pattern-Aware Feature Gene Selection in Single-Cell RNA-Seq Data.

机构信息

School of Computer Science, University of Windsor, Windsor, ON N9B 3P4, Canada.

出版信息

Genes (Basel). 2023 Feb 26;14(3):596. doi: 10.3390/genes14030596.

DOI:10.3390/genes14030596

PMID:36980868

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10048047/

Abstract

With the advances in high-throughput sequencing technology, an increasing amount of research in revealing heterogeneity among cells has been widely performed. Differences between individual cells' functionality are determined based on the differences in the gene expression profiles. Although the observations indicate a great performance of clustering methods, manual annotation of the clusters of cells is a challenge yet to be addressed more scalable and faster. On the other hand, due to the lack of enough labelled datasets, just a few supervised techniques have been used in cell type identification, and they obtained more robust results compared to clustering methods. A recent study showed that a complementary step of feature selection helped support vector machine (SVM) to outperform other classifiers in different scenarios. In this article, we compare and evaluate the performance of two state-of-the-art supervised methods, XGBoost and SVM, with information gain as a feature selection method. The results of the experiments on three standard scRNA-seq datasets indicate that XGBoost automatically annotates cell types in a simpler and more scalable framework. Additionally, it sheds light on the potential use of boosting tree approaches combined with deep neural networks to capture underlying information of single-cell RNA-Seq data more effectively. It can be used to identify marker genes and other applications in biological studies.

摘要

随着高通量测序技术的进步，越来越多的研究广泛地揭示了细胞之间的异质性。个体细胞功能的差异是基于基因表达谱的差异来确定的。尽管观察表明聚类方法具有很好的性能，但手动注释细胞聚类仍然是一个尚未解决的挑战，需要更具可扩展性和更快的速度。另一方面，由于缺乏足够的标记数据集，只有少数监督技术被用于细胞类型识别，并且它们与聚类方法相比获得了更稳健的结果。最近的一项研究表明，特征选择的补充步骤有助于支持向量机（SVM）在不同场景下优于其他分类器。在本文中，我们比较和评估了两种最先进的监督方法，XGBoost 和 SVM，以及信息增益作为特征选择方法的性能。在三个标准 scRNA-seq 数据集上的实验结果表明，XGBoost 以更简单和更具可扩展性的框架自动注释细胞类型。此外，它还揭示了使用提升树方法结合深度神经网络更有效地捕获单细胞 RNA-Seq 数据潜在信息的潜力。它可用于识别标记基因和生物研究中的其他应用。