Lodi Musaddiq K, Clark Leiliani, Roy Satyaki, Ghosh Preetam
Integrative Life Sciences, Virginia Commonwealth University, Richmond, VA, United States of America.
Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA, United States of America.
bioRxiv. 2024 Dec 23:2024.12.23.630040. doi: 10.1101/2024.12.23.630040.
The advent of single-cell RNA sequencing (scRNA-seq) has greatly enhanced our ability to explore cellular heterogeneity with high resolution. Identifying subpopulations of cells and their associated molecular markers is crucial in understanding their distinct roles in tissues. To address the challenges in marker gene selection, we introduce CORTADO, a computational framework based on hill-climbing optimization for the efficient discovery of cell-type-specific markers. CORTADO optimizes three critical properties: differential expression in the clusters of interest, distinctiveness in gene expression profiles to minimize redundancy, and sparseness to ensure a concise and biologically meaningful marker set. Unlike traditional methods that rely on ranking genes by p-values, CORTADO incorporates both differential expression metrics and penalties for overlapping expression profiles, ensuring that each selected marker uniquely represents its cluster while maintaining biological relevance. Its flexibility supports both constrained and unconstrained marker selection, allowing users to specify the number of markers to identify, making it adaptable to diverse analytical needs and scalable to datasets with varying complexities. To validate its performance, we apply CORTADO to several datasets, including the DLPFC 151507 dataset, the Zeisel mouse brain dataset, and a peripheral blood mononuclear cell dataset. Through enrichment analysis and examination of spatial localization-based expression, we demonstrate the robustness of CORTADO in identifying biologically relevant and non-redundant markers in complex datasets. CORTADO provides an efficient and scalable solution for cell-type marker discovery, offering improved sensitivity and specificity compared to existing methods.
单细胞RNA测序(scRNA-seq)的出现极大地增强了我们以高分辨率探索细胞异质性的能力。识别细胞亚群及其相关分子标志物对于理解它们在组织中的不同作用至关重要。为了应对标记基因选择中的挑战,我们引入了CORTADO,这是一个基于爬山优化的计算框架,用于高效发现细胞类型特异性标志物。CORTADO优化了三个关键属性:在感兴趣的簇中的差异表达、基因表达谱中的独特性以最小化冗余,以及稀疏性以确保简洁且具有生物学意义的标记集。与传统方法依赖于按p值对基因进行排名不同,CORTADO结合了差异表达指标和对重叠表达谱的惩罚,确保每个选定的标志物唯一地代表其簇,同时保持生物学相关性。它的灵活性支持受限和无约束的标记选择,允许用户指定要识别的标记数量,使其能够适应各种分析需求,并可扩展到具有不同复杂性的数据集。为了验证其性能,我们将CORTADO应用于几个数据集,包括DLPFC 151507数据集、Zeisel小鼠脑数据集和外周血单核细胞数据集。通过富集分析和基于空间定位的表达检查,我们证明了CORTADO在识别复杂数据集中生物学相关且无冗余的标志物方面的稳健性。CORTADO为细胞类型标志物发现提供了一种高效且可扩展的解决方案,与现有方法相比,具有更高的灵敏度和特异性。