Chen Jinjin, Mohamed Ahmed, Bhuva Dharmesh D, Davis Melissa J, Tan Chin Wee
Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC 3052, Australia.
Department of Medical Biology, Faculty of Medicine, Dentistry and Health Sciences, University of Melbourne, Parkville, VIC 3010, Australia.
Bioinformatics. 2025 Mar 4;41(3). doi: 10.1093/bioinformatics/btaf114.
Biomarker discovery is important and offers insight into potential underlying mechanisms of disease. While existing biomarker identification methods primarily focus on single cell RNA sequencing (scRNA-seq) data, there remains a need for automated methods designed for labeled bulk RNA-seq data from sorted cell populations or experiments. Current methods require curation of results or statistical thresholds and may not account for tissue background expression. Here we bridge these limitations with an automated marker identification method for labeled bulk RNA-seq data that explicitly considers background expressions.
We developed mastR, a novel tool for accurate marker identification using transcriptomic data. It leverages robust statistical pipelines like edgeR and limma to perform pairwise comparisons between groups, and aggregates results using rank-product-based permutation test. A signal-to-noise ratio approach is implemented to minimize background signals. We assessed the performance of mastR-derived NK cell signatures against published curated signatures and found that the mastR-derived signature performs as well, if not better than the published signatures. We further demonstrated the utility of mastR on simulated scRNA-seq data and in comparison with Seurat in terms of marker selection performance.
mastR is freely available from https://bioconductor.org/packages/release/bioc/html/mastR.html. A vignette and guide are available at https://davislaboratory.github.io/mastR. All statistical analyses were carried out using R (version ≥4.3.0) and Bioconductor (version ≥3.17).
生物标志物的发现很重要,它能深入了解疾病潜在的机制。虽然现有的生物标志物识别方法主要集中在单细胞RNA测序(scRNA-seq)数据上,但对于为来自分选细胞群体或实验的标记批量RNA测序数据设计的自动化方法仍有需求。目前的方法需要对结果进行整理或设定统计阈值,而且可能没有考虑组织背景表达。在这里,我们通过一种用于标记批量RNA测序数据的自动化标记识别方法克服了这些局限性,该方法明确考虑了背景表达。
我们开发了mastR,这是一种利用转录组数据进行准确标记识别的新型工具。它利用edgeR和limma等强大的统计流程在组间进行成对比较,并使用基于秩乘积的置换检验汇总结果。采用信噪比方法来最小化背景信号。我们将mastR衍生的自然杀伤细胞特征与已发表的经过整理的特征进行了性能评估,发现mastR衍生的特征即使不比已发表的特征更好,也表现得一样好。我们进一步展示了mastR在模拟scRNA-seq数据上的效用,并在标记选择性能方面与Seurat进行了比较。
mastR可从https://bioconductor.org/packages/release/bioc/html/mastR.html免费获取。在https://davislaboratory.github.io/mastR上可获取一个 vignette 和指南。所有统计分析均使用R(版本≥4.3.0)和Bioconductor(版本≥3.17)进行。