TCGA癌症数据的机器学习分析。

Machine learning analysis of TCGA cancer data.

作者信息

Liñares-Blanco Jose, Pazos Alejandro, Fernandez-Lozano Carlos

机构信息

CITIC-Research Center of Information and Communication Technologies, University of A Coruna, A Coruña, Spain.

Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruna, A Coruña, Spain.

出版信息

PeerJ Comput Sci. 2021 Jul 12;7:e584. doi: 10.7717/peerj-cs.584. eCollection 2021.

Abstract

In recent years, machine learning (ML) researchers have changed their focus towards biological problems that are difficult to analyse with standard approaches. Large initiatives such as The Cancer Genome Atlas (TCGA) have allowed the use of omic data for the training of these algorithms. In order to study the state of the art, this review is provided to cover the main works that have used ML with TCGA data. Firstly, the principal discoveries made by the TCGA consortium are presented. Once these bases have been established, we begin with the main objective of this study, the identification and discussion of those works that have used the TCGA data for the training of different ML approaches. After a review of more than 100 different papers, it has been possible to make a classification according to following three pillars: the type of tumour, the type of algorithm and the predicted biological problem. One of the conclusions drawn in this work shows a high density of studies based on two major algorithms: Random Forest and Support Vector Machines. We also observe the rise in the use of deep artificial neural networks. It is worth emphasizing, the increase of integrative models of multi-omic data analysis. The different biological conditions are a consequence of molecular homeostasis, driven by both protein coding regions, regulatory elements and the surrounding environment. It is notable that a large number of works make use of genetic expression data, which has been found to be the preferred method by researchers when training the different models. The biological problems addressed have been classified into five types: prognosis prediction, tumour subtypes, microsatellite instability (MSI), immunological aspects and certain pathways of interest. A clear trend was detected in the prediction of these conditions according to the type of tumour. That is the reason for which a greater number of works have focused on the BRCA cohort, while specific works for survival, for example, were centred on the GBM cohort, due to its large number of events. Throughout this review, it will be possible to go in depth into the works and the methodologies used to study TCGA cancer data. Finally, it is intended that this work will serve as a basis for future research in this field of study.

摘要

近年来,机器学习(ML)研究人员已将研究重点转向难以用标准方法分析的生物学问题。诸如癌症基因组图谱(TCGA)之类的大型项目使得可以利用组学数据来训练这些算法。为了研究当前的技术水平,本综述涵盖了使用ML和TCGA数据的主要研究成果。首先,介绍了TCGA联盟取得的主要发现。奠定这些基础之后,我们开始进行本研究的主要目标,即识别和讨论那些使用TCGA数据训练不同ML方法的研究。在对100多篇不同论文进行综述之后,得以根据以下三个方面进行分类:肿瘤类型、算法类型和预测的生物学问题。这项工作得出的结论之一是,基于两种主要算法的研究密度很高:随机森林和支持向量机。我们还观察到深度人工神经网络的使用有所增加。值得强调的是,多组学数据分析的整合模型有所增加。不同的生物学状况是分子稳态的结果,由蛋白质编码区域、调控元件和周围环境共同驱动。值得注意的是,大量研究使用了基因表达数据,研究人员发现这是训练不同模型时的首选方法。所涉及的生物学问题已分为五种类型:预后预测、肿瘤亚型、微卫星不稳定性(MSI)、免疫学方面和某些感兴趣的途径。根据肿瘤类型在预测这些状况时发现了明显的趋势。这就是为什么更多的研究集中在BRCA队列上,而例如针对生存的特定研究则集中在GBM队列上,因为该队列中的事件数量很多。在整个综述中,可以深入研究用于研究TCGA癌症数据的研究和方法。最后,希望这项工作将为该研究领域的未来研究奠定基础。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f37/8293929/f24d1de7a824/peerj-cs-07-584-g001.jpg

引用本文的文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索