TCGA癌症数据的机器学习分析。

Machine learning analysis of TCGA cancer data.

作者信息

Liñares-Blanco Jose, Pazos Alejandro, Fernandez-Lozano Carlos

机构信息

CITIC-Research Center of Information and Communication Technologies, University of A Coruna, A Coruña, Spain.

Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruna, A Coruña, Spain.

出版信息

PeerJ Comput Sci. 2021 Jul 12;7:e584. doi: 10.7717/peerj-cs.584. eCollection 2021.

DOI:10.7717/peerj-cs.584

PMID:34322589

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8293929/

Abstract

In recent years, machine learning (ML) researchers have changed their focus towards biological problems that are difficult to analyse with standard approaches. Large initiatives such as The Cancer Genome Atlas (TCGA) have allowed the use of omic data for the training of these algorithms. In order to study the state of the art, this review is provided to cover the main works that have used ML with TCGA data. Firstly, the principal discoveries made by the TCGA consortium are presented. Once these bases have been established, we begin with the main objective of this study, the identification and discussion of those works that have used the TCGA data for the training of different ML approaches. After a review of more than 100 different papers, it has been possible to make a classification according to following three pillars: the type of tumour, the type of algorithm and the predicted biological problem. One of the conclusions drawn in this work shows a high density of studies based on two major algorithms: Random Forest and Support Vector Machines. We also observe the rise in the use of deep artificial neural networks. It is worth emphasizing, the increase of integrative models of multi-omic data analysis. The different biological conditions are a consequence of molecular homeostasis, driven by both protein coding regions, regulatory elements and the surrounding environment. It is notable that a large number of works make use of genetic expression data, which has been found to be the preferred method by researchers when training the different models. The biological problems addressed have been classified into five types: prognosis prediction, tumour subtypes, microsatellite instability (MSI), immunological aspects and certain pathways of interest. A clear trend was detected in the prediction of these conditions according to the type of tumour. That is the reason for which a greater number of works have focused on the BRCA cohort, while specific works for survival, for example, were centred on the GBM cohort, due to its large number of events. Throughout this review, it will be possible to go in depth into the works and the methodologies used to study TCGA cancer data. Finally, it is intended that this work will serve as a basis for future research in this field of study.

摘要

近年来，机器学习（ML）研究人员已将研究重点转向难以用标准方法分析的生物学问题。诸如癌症基因组图谱（TCGA）之类的大型项目使得可以利用组学数据来训练这些算法。为了研究当前的技术水平，本综述涵盖了使用ML和TCGA数据的主要研究成果。首先，介绍了TCGA联盟取得的主要发现。奠定这些基础之后，我们开始进行本研究的主要目标，即识别和讨论那些使用TCGA数据训练不同ML方法的研究。在对100多篇不同论文进行综述之后，得以根据以下三个方面进行分类：肿瘤类型、算法类型和预测的生物学问题。这项工作得出的结论之一是，基于两种主要算法的研究密度很高：随机森林和支持向量机。我们还观察到深度人工神经网络的使用有所增加。值得强调的是，多组学数据分析的整合模型有所增加。不同的生物学状况是分子稳态的结果，由蛋白质编码区域、调控元件和周围环境共同驱动。值得注意的是，大量研究使用了基因表达数据，研究人员发现这是训练不同模型时的首选方法。所涉及的生物学问题已分为五种类型：预后预测、肿瘤亚型、微卫星不稳定性（MSI）、免疫学方面和某些感兴趣的途径。根据肿瘤类型在预测这些状况时发现了明显的趋势。这就是为什么更多的研究集中在BRCA队列上，而例如针对生存的特定研究则集中在GBM队列上，因为该队列中的事件数量很多。在整个综述中，可以深入研究用于研究TCGA癌症数据的研究和方法。最后，希望这项工作将为该研究领域的未来研究奠定基础。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f37/8293929/f24d1de7a824/peerj-cs-07-584-g001.jpg

相似文献

Machine learning analysis of TCGA cancer data.

PeerJ Comput Sci. 2021 Jul 12;7:e584. doi: 10.7717/peerj-cs.584. eCollection 2021.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Data-driven modeling and prediction of blood glucose dynamics: Machine learning applications in type 1 diabetes.

Artif Intell Med. 2019 Jul;98:109-134. doi: 10.1016/j.artmed.2019.07.007. Epub 2019 Jul 26.

Artificial neural networks for multi-omics classifications of hepato-pancreato-biliary cancers: towards the clinical application of genetic data.

Eur J Cancer. 2021 May;148:348-358. doi: 10.1016/j.ejca.2021.01.049. Epub 2021 Mar 26.

Spatially aware graph neural networks and cross-level molecular profile prediction in colon cancer histopathology: a retrospective multi-cohort study.

Lancet Digit Health. 2022 Nov;4(11):e787-e795. doi: 10.1016/S2589-7500(22)00168-6.

BRCA-Pathway: a structural integration and visualization system of TCGA breast cancer data on KEGG pathways.

BMC Bioinformatics. 2018 Feb 19;19(Suppl 1):42. doi: 10.1186/s12859-018-2016-6.

Integrative Network Fusion: A Multi-Omics Approach in Molecular Profiling.

Front Oncol. 2020 Jun 30;10:1065. doi: 10.3389/fonc.2020.01065. eCollection 2020.

Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study.

J Med Internet Res. 2020 Aug 10;22(8):e18387. doi: 10.2196/18387.

PreMSIm: An R package for predicting microsatellite instability from the expression profiling of a gene panel in cancer.

Comput Struct Biotechnol J. 2020 Mar 19;18:668-675. doi: 10.1016/j.csbj.2020.03.007. eCollection 2020.

Predicting Deep Learning Based Multi-Omics Parallel Integration Survival Subtypes in Lung Cancer Using Reverse Phase Protein Array Data.

Biomolecules. 2020 Oct 19;10(10):1460. doi: 10.3390/biom10101460.

引用本文的文献

Cancer genomics and bioinformatics in Latin American countries: applications, challenges, and perspectives.

Front Oncol. 2025 Jul 9;15:1584178. doi: 10.3389/fonc.2025.1584178. eCollection 2025.

Neural network prediction model based on Levy flight and natural biomimetic technology for its application in cancer prediction.

PLoS One. 2025 Jun 25;20(6):e0326874. doi: 10.1371/journal.pone.0326874. eCollection 2025.

Modulatory effect of metformin and its transporters on immune infiltration in tumor microenvironment: a bioinformatic study with experimental validation.

Discov Oncol. 2025 May 31;16(1):973. doi: 10.1007/s12672-025-02766-y.

Front Cardiovasc Med. 2025 Mar 31;12:1516043. doi: 10.3389/fcvm.2025.1516043. eCollection 2025.

Exploring the Mechanism of Canmei Formula in Preventing and Treating Recurrence of Colorectal Adenoma Based on Data Mining and Algorithm Prediction.

Biol Proced Online. 2025 Feb 1;27(1):4. doi: 10.1186/s12575-025-00266-5.

A Risk Model Based on Ferroptosis-Related Genes OSMR, G0S2, IGFBP6, IGHG2, and FMOD Predicts Prognosis in Glioblastoma Multiforme.

CNS Neurosci Ther. 2025 Jan;31(1):e70161. doi: 10.1111/cns.70161.

Deep learning to assess microsatellite instability directly from histopathological whole slide images in endometrial cancer.

NPJ Digit Med. 2024 May 29;7(1):143. doi: 10.1038/s41746-024-01131-7.

A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.

BMC Bioinformatics. 2024 May 8;25(1):181. doi: 10.1186/s12859-024-05801-x.

Enhancing cancer stage prediction through hybrid deep neural networks: a comparative study.

Front Big Data. 2024 Mar 22;7:1359703. doi: 10.3389/fdata.2024.1359703. eCollection 2024.

Multilayered insights: a machine learning approach for personalized prognostic assessment in hepatocellular carcinoma.

Front Oncol. 2024 Feb 29;13:1327147. doi: 10.3389/fonc.2023.1327147. eCollection 2023.

本文引用的文献

Mutation-based clustering and classification analysis reveals distinctive age groups and age-related biomarkers for glioma.

BMC Med Inform Decis Mak. 2021 Feb 27;21(1):77. doi: 10.1186/s12911-021-01420-1.

Deep learning with multimodal representation for pancancer prognosis prediction.

Bioinformatics. 2019 Jul 15;35(14):i446-i454. doi: 10.1093/bioinformatics/btz342.

Radiomics MRI Phenotyping with Machine Learning to Predict the Grade of Lower-Grade Gliomas: A Study Focused on Nonenhancing Tumors.

Korean J Radiol. 2019 Sep;20(9):1381-1389. doi: 10.3348/kjr.2018.0814.

Integrative prognostic subtype discovery in high-grade serous ovarian cancer.

J Cell Biochem. 2019 Nov;120(11):18659-18666. doi: 10.1002/jcb.29049. Epub 2019 Jul 26.

Pathway-based deep clustering for molecular subtyping of cancer.

Methods. 2020 Feb 15;173:24-31. doi: 10.1016/j.ymeth.2019.06.017. Epub 2019 Jun 25.

Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality.

BMC Bioinformatics. 2019 Jun 17;20(1):339. doi: 10.1186/s12859-019-2929-8.

HistoQC: An Open-Source Quality Control Tool for Digital Pathology Slides.

JCO Clin Cancer Inform. 2019 Apr;3:1-7. doi: 10.1200/CCI.18.00157.

Prognostic Gene Discovery in Glioblastoma Patients using Deep Learning.

Cancers (Basel). 2019 Jan 8;11(1):53. doi: 10.3390/cancers11010053.

GSAE: an autoencoder with embedded gene-set nodes for genomics functional characterization.

BMC Syst Biol. 2018 Dec 21;12(Suppl 8):142. doi: 10.1186/s12918-018-0642-2.

Sparse coding of pathology slides compared to transfer learning with deep neural networks.

BMC Bioinformatics. 2018 Dec 21;19(Suppl 18):489. doi: 10.1186/s12859-018-2504-8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

TCGA癌症数据的机器学习分析。

Machine learning analysis of TCGA cancer data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献