Suppr超能文献

使用 TCGA-HNSC 数据集进行机器学习:通过解决不一致性、稀疏性和高维性来提高可用性。

Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality.

机构信息

Department of Electrical and Computer Engineering, Center for Bioinformatics and Computational Biology, University of Iowa, 5017 Seamans Center, Iowa City, IA, 52242, USA.

Department of Radiation Oncology, Carver College of Medicine, University of Iowa Carver College of Medicine, LL-W Pomerantz Family Pavilion, 200 Hawkins Drive, Iowa City, IA, 52242-1089, USA.

出版信息

BMC Bioinformatics. 2019 Jun 17;20(1):339. doi: 10.1186/s12859-019-2929-8.

Abstract

BACKGROUND

In the era of precision oncology and publicly available datasets, the amount of information available for each patient case has dramatically increased. From clinical variables and PET-CT radiomics measures to DNA-variant and RNA expression profiles, such a wide variety of data presents a multitude of challenges. Large clinical datasets are subject to sparsely and/or inconsistently populated fields. Corresponding sequencing profiles can suffer from the problem of high-dimensionality, where making useful inferences can be difficult without correspondingly large numbers of instances. In this paper we report a novel deployment of machine learning techniques to handle data sparsity and high dimensionality, while evaluating potential biomarkers in the form of unsupervised transformations of RNA data. We apply preprocessing, MICE imputation, and sparse principal component analysis (SPCA) to improve the usability of more than 500 patient cases from the TCGA-HNSC dataset for enhancing future oncological decision support for Head and Neck Squamous Cell Carcinoma (HNSCC).

RESULTS

Imputation was shown to improve prognostic ability of sparse clinical treatment variables. SPCA transformation of RNA expression variables reduced runtime for RNA-based models, though changes to classifier performance were not significant. Gene ontology enrichment analysis of gene sets associated with individual sparse principal components (SPCs) are also reported, showing that both high- and low-importance SPCs were associated with cell death pathways, though the high-importance gene sets were found to be associated with a wider variety of cancer-related biological processes.

CONCLUSIONS

MICE imputation allowed us to impute missing values for clinically informative features, improving their overall importance for predicting two-year recurrence-free survival by incorporating variance from other clinical variables. Dimensionality reduction of RNA expression profiles via SPCA reduced both computation cost and model training/evaluation time without affecting classifier performance, allowing researchers to obtain experimental results much more quickly. SPCA simultaneously provided a convenient avenue for consideration of biological context via gene ontology enrichment analysis.

摘要

背景

在精准肿瘤学和公开可用数据集的时代,每个患者病例可用的信息量大大增加。从临床变量和 PET-CT 放射组学测量值到 DNA 变体和 RNA 表达谱,如此多样化的数据带来了许多挑战。大型临床数据集存在字段稀疏和/或不一致的问题。相应的测序谱可能存在高维性问题,如果没有相应数量的实例,很难做出有用的推断。在本文中,我们报告了一种新的机器学习技术部署,用于处理数据稀疏性和高维度性,同时以 RNA 数据无监督变换的形式评估潜在的生物标志物。我们应用预处理、MICE 插补和稀疏主成分分析(SPCA)来提高 TCGA-HNSC 数据集的 500 多个患者病例的可用性,以增强对头颈部鳞状细胞癌(HNSCC)的未来肿瘤学决策支持。

结果

插补显示可提高稀疏临床治疗变量的预后能力。RNA 表达变量的 SPCA 变换减少了基于 RNA 的模型的运行时,但对分类器性能的变化没有显著影响。还报告了与单个稀疏主成分(SPC)相关的基因集的基因本体富集分析,结果表明,高重要性和低重要性 SPC 都与细胞死亡途径相关,尽管高重要性基因集与更广泛的癌症相关生物学过程相关。

结论

MICE 插补允许我们对具有临床意义的特征进行缺失值插补,通过合并来自其他临床变量的方差,提高它们对预测两年无复发生存率的整体重要性。通过 SPCA 对 RNA 表达谱进行降维,在不影响分类器性能的情况下,减少了计算成本和模型训练/评估时间,使研究人员能够更快地获得实验结果。SPCA 同时通过基因本体富集分析为考虑生物学背景提供了一个方便的途径。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3fd9/6580485/48ad61579609/12859_2019_2929_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验