Clemson University, Department of Genetics & Biochemistry, Clemson, 29634, SC, USA.
Quantum Insights Inc., Menlo Park, 94025, California, USA.
Sci Rep. 2018 May 25;8(1):8180. doi: 10.1038/s41598-018-26310-x.
We applied two state-of-the-art, knowledge independent data-mining methods - Dynamic Quantum Clustering (DQC) and t-Distributed Stochastic Neighbor Embedding (t-SNE) - to data from The Cancer Genome Atlas (TCGA). We showed that the RNA expression patterns for a mixture of 2,016 samples from five tumor types can sort the tumors into groups enriched for relevant annotations including tumor type, gender, tumor stage, and ethnicity. DQC feature selection analysis discovered 48 core biomarker transcripts that clustered tumors by tumor type. When these transcripts were removed, the geometry of tumor relationships changed, but it was still possible to classify the tumors using the RNA expression profiles of the remaining transcripts. We continued to remove the top biomarkers for several iterations and performed cluster analysis. Even though the most informative transcripts were removed from the cluster analysis, the sorting ability of remaining transcripts remained strong after each iteration. Further, in some iterations we detected a repeating pattern of biological function that wasn't detectable with the core biomarker transcripts present. This suggests the existence of a "background classification" potential in which the pattern of gene expression after continued removal of "biomarker" transcripts could still classify tumors in agreement with the tumor type.
我们应用了两种最先进的、与知识无关的数据挖掘方法——动态量子聚类(DQC)和 t 分布随机邻域嵌入(t-SNE)——来分析来自癌症基因组图谱(TCGA)的数据。我们表明,从五种肿瘤类型的 2016 个混合样本的 RNA 表达模式可以将肿瘤按与肿瘤类型、性别、肿瘤分期和种族相关的注释进行分组。DQC 特征选择分析发现了 48 个核心生物标志物转录本,可以根据肿瘤类型对肿瘤进行聚类。当这些转录本被去除后,肿瘤关系的几何形状发生了变化,但仍然可以使用剩余转录本的 RNA 表达谱对肿瘤进行分类。我们继续进行几次迭代,去除了前几个生物标志物,并进行了聚类分析。即使从聚类分析中去除了最具信息量的转录本,在每次迭代后,剩余转录本的分类能力仍然很强。此外,在某些迭代中,我们检测到了一种重复的生物学功能模式,这在存在核心生物标志物转录本的情况下是无法检测到的。这表明存在一种“背景分类”的潜在可能性,即在继续去除“生物标志物”转录本后,基因表达模式仍然可以与肿瘤类型一致地对肿瘤进行分类。