Zhang Xiaokang, Jonassen Inge, Goksøyr Anders
Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway
Center for Cancer Biomarkers, Department of Informatics, University of Bergen, Bergen, Norway
Biomarkers are of great importance in many fields, such as cancer research, toxicology, diagnosis and treatment of diseases, and to better understand biological response mechanisms to internal or external intervention. High-throughput gene expression profiling technologies, such as DNA microarrays and RNA sequencing, provide large gene expression data sets which enable data-driven biomarker discovery. Traditional statistical tests have been the mainstream for identifying differentially expressed genes as biomarkers. In recent years, machine learning techniques such as feature selection have gained more popularity. Given many options, picking the most appropriate method for a particular data becomes essential. Different evaluation metrics have therefore been proposed. Being evaluated on different aspects, a method’s varied performance across different datasets leads to the idea of integrating multiple methods. Many integration strategies are proposed and have shown great potential. This chapter gives an overview of the current research advances and existing issues in biomarker discovery using machine learning approaches on gene expression data.
生物标志物在许多领域都非常重要,如癌症研究、毒理学、疾病的诊断和治疗,以及为了更好地理解对内部或外部干预的生物反应机制。高通量基因表达谱技术,如DNA微阵列和RNA测序,提供了大量的基因表达数据集,从而能够进行数据驱动的生物标志物发现。传统统计测试一直是识别差异表达基因作为生物标志物的主流方法。近年来,诸如特征选择等机器学习技术越来越受欢迎。面对众多选择,为特定数据挑选最合适的方法变得至关重要。因此,人们提出了不同的评估指标。由于在不同方面进行评估,一种方法在不同数据集上的表现各异,这就催生了整合多种方法的想法。人们提出了许多整合策略,并已显示出巨大潜力。本章概述了使用机器学习方法处理基因表达数据进行生物标志物发现的当前研究进展和存在的问题。