Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA 02118, USA.
Bioinformatics Program, College of Engineering, Boston University, Boston, MA 02118, USA.
Nucleic Acids Res. 2021 Sep 27;49(17):e98. doi: 10.1093/nar/gkab552.
As high-throughput genomics assays become more efficient and cost effective, their utilization has become standard in large-scale biomedical projects. These studies are often explorative, in that relationships between samples are not explicitly defined a priori, but rather emerge from data-driven discovery and annotation of molecular subtypes, thereby informing hypotheses and independent evaluation. Here, we present K2Taxonomer, a novel unsupervised recursive partitioning algorithm and associated R package that utilize ensemble learning to identify robust subgroups in a 'taxonomy-like' structure. K2Taxonomer was devised to accommodate different data paradigms, and is suitable for the analysis of both bulk and single-cell transcriptomics, and other '-omics', data. For each of these data types, we demonstrate the power of K2Taxonomer to discover known relationships in both simulated and human tissue data. We conclude with a practical application on breast cancer tumor infiltrating lymphocyte (TIL) single-cell profiles, in which we identified co-expression of translational machinery genes as a dominant transcriptional program shared by T cells subtypes, associated with better prognosis in breast cancer tissue bulk expression data.
随着高通量基因组学检测变得更加高效和经济实惠,其在大规模生物医学项目中的应用已经成为标准。这些研究通常是探索性的,因为样本之间的关系不是事先明确定义的,而是从数据驱动的分子亚型发现和注释中出现的,从而为假设和独立评估提供信息。在这里,我们提出了 K2Taxonomer,这是一种新颖的无监督递归分区算法和相关的 R 包,它利用集成学习在“分类学样”结构中识别稳健的亚群。K2Taxonomer 的设计旨在适应不同的数据范例,适用于批量和单细胞转录组学以及其他“组学”数据的分析。对于每种数据类型,我们都证明了 K2Taxonomer 在模拟和人类组织数据中发现已知关系的能力。我们以乳腺癌肿瘤浸润淋巴细胞 (TIL) 单细胞谱的实际应用为例结束了本文,我们确定了翻译机制基因的共表达是 T 细胞亚群共享的主要转录程序,与乳腺癌组织批量表达数据中的更好预后相关。