Brain Research Center, Zhongnan Hospital, Second Clinical School, Wuhan University, Wuhan, China.
Graduate School of Biostudies, Kyoto University, Kyoto, Japan.
Commun Biol. 2024 Sep 13;7(1):1128. doi: 10.1038/s42003-024-06790-6.
Revealing the heterogeneity among tissues is the greatest advantage of single-cell-sequencing. Marker genes not only act as the key to correctly identify cell types, but also the bio-markers for cell-status under certain experimental imputations. Current analysis methods such as Seurat and Monocle employ algorithms which compares one cluster to all the rest and select markers according to statistical tests. This pattern brings redundant calculations and thus, results in low calculation efficiency, specificity and accuracy. To address these issues, we introduce starTracer, a novel algorithm designed to enhance the efficiency, specificity and accuracy of marker gene identification in single-cell RNA-seq data analysis. starTracer operates as an independent pipeline, which exhibits great flexibility by accepting multiple input file types. The primary output is a marker matrix, where genes are sorted by the potential to function as markers, with those exhibiting the greatest potential positioned at the top. The speed improvement ranges by 2 ~ 3 orders of magnitude compared to Seurat, as observed across three independent datasets with lower false positive rate as observed in a simulated testing dataset with ground-truth. It's worth noting that starTracer exhibits increasing speed improvement with larger data volumes. It also excels in identifying markers in smaller clusters. These advantages solidify starTracer as an important tool for single-cell RNA-seq data, merging robust accuracy with exceptional speed.
揭示组织间的异质性是单细胞测序的最大优势。标记基因不仅是正确识别细胞类型的关键,也是特定实验推断下细胞状态的生物标志物。当前的分析方法,如 Seurat 和 Monocle,采用的算法是将一个簇与所有其他簇进行比较,并根据统计检验选择标记基因。这种模式带来了冗余的计算,从而导致计算效率、特异性和准确性降低。为了解决这些问题,我们引入了 starTracer,这是一种专门设计的算法,用于提高单细胞 RNA-seq 数据分析中标记基因识别的效率、特异性和准确性。starTracer 作为一个独立的流水线运行,通过接受多种输入文件类型,表现出极大的灵活性。主要输出是一个标记基因矩阵,其中基因按作为标记的潜力排序,具有最大潜力的基因排在顶部。与 Seurat 相比,速度提高了 2 到 3 个数量级,在三个独立的数据集和一个具有真实标记的模拟测试数据集上都观察到了更低的假阳性率。值得注意的是,starTracer 随着数据量的增加,速度提升的幅度也越来越大。它在识别较小簇的标记基因方面也表现出色。这些优势使 starTracer 成为单细胞 RNA-seq 数据的重要工具,它具有稳健的准确性和出色的速度。