NUS Graduate School for Integrative Sciences & Engineering, National University of Singapore, Singapore, Singapore.
Department of Biomedical Engineering, National University of Singapore, Singapore, Singapore.
Sci Data. 2019 Oct 8;6(1):194. doi: 10.1038/s41597-019-0207-2.
There are massive transcriptome profiles in the form of microarray. The challenge is that they are processed using diverse platforms and preprocessing tools, requiring considerable time and informatics expertise for cross-dataset analyses. If there exists a single, integrated data source, data-reuse can be facilitated for discovery, analysis, and validation of biomarker-based clinical strategy. Here, we present merged microarray-acquired datasets (MMDs) across 11 major cancer types, curating 8,386 patient-derived tumor and tumor-free samples from 95 GEO datasets. Using machine learning algorithms, we show that diagnostic models trained from MMDs can be directly applied to RNA-seq-acquired TCGA data with high classification accuracy. Machine learning optimized MMD further aids to reveal immune landscape across various carcinomas critically needed in disease management and clinical interventions. This unified data source may serve as an excellent training or test set to apply, develop, and refine machine learning algorithms that can be tapped to better define genomic landscape of human cancers.
存在大量以微阵列形式呈现的转录组谱。挑战在于,它们是使用不同的平台和预处理工具进行处理的,因此需要相当多的时间和信息学专业知识来进行跨数据集分析。如果存在一个单一的、集成的数据来源,那么就可以方便地进行数据再利用,以发现、分析和验证基于生物标志物的临床策略。在这里,我们展示了跨 11 种主要癌症类型的合并微阵列获得数据集 (MMD),整理了来自 95 个 GEO 数据集的 8386 个患者来源的肿瘤和无肿瘤样本。使用机器学习算法,我们表明,从 MMD 训练的诊断模型可以直接应用于 TCGA 数据的 RNA-seq 获得数据,具有很高的分类准确性。经过机器学习优化的 MMD 进一步有助于揭示各种癌症的免疫景观,这在疾病管理和临床干预中是至关重要的。这个统一的数据源可以作为一个极好的训练或测试集,用于应用、开发和完善机器学习算法,这些算法可以用来更好地定义人类癌症的基因组景观。