Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, 250061, Shandong, China.
Center for Intelligent Medicine, Shandong University, Jinan, 250061, Shandong, China.
BMC Med Genomics. 2021 Aug 25;14(Suppl 1):112. doi: 10.1186/s12920-021-00957-4.
Hepatocellular carcinoma (HCC) is one of the most common cancers. The discovery of specific genes severing as biomarkers is of paramount significance for cancer diagnosis and prognosis. The high-throughput omics data generated by the cancer genome atlas (TCGA) consortium provides a valuable resource for the discovery of HCC biomarker genes. Numerous methods have been proposed to select cancer biomarkers. However, these methods have not investigated the robustness of identification with different feature selection techniques.
We use six different recursive feature elimination methods to select the gene signiatures of HCC from TCGA liver cancer data. The genes shared in the six selected subsets are proposed as robust biomarkers. Akaike information criterion (AIC) is employed to explain the optimization process of feature selection, which provides a statistical interpretation for the feature selection in machine learning methods. And we use several methods to validate the screened biomarkers.
In this paper, we propose a robust method for discovering biomarker genes for HCC from gene expression data. Specifically, we implement recursive feature elimination cross-validation (RFE-CV) methods based on six different classication algorithms. The overlaps in the discovered gene sets via different methods are referred as the identified biomarkers. We give an interpretation of the feature selection process based on machine learning using AIC in statistics. Furthermore, the features selected by the backward logistic stepwise regression via AIC minimum theory are completely contained in the identified biomarkers. Through the classification results, the superiority of interpretable robust biomarker discovery method is verified.
It is found that overlaps among gene subsets contain different quantitative features selected by the RFE-CV of 6 classifiers. The AIC values in the model selection provide a theoretical foundation for the feature selection process of biomarker discovery via machine learning. What's more, genes containing in more optimally selected subsets make better biological sense and implication. The quality of feature selection is improved by the intersections of biomarkers selected from different classifiers. This is a general method suitable for screening biomarkers of complex diseases from high-throughput data.
肝细胞癌(HCC)是最常见的癌症之一。发现特异性基因作为生物标志物对癌症的诊断和预后具有重要意义。癌症基因组图谱(TCGA)联盟生成的高通量组学数据为发现 HCC 生物标志物基因提供了宝贵的资源。已经提出了许多用于选择癌症生物标志物的方法。然而,这些方法并没有研究不同特征选择技术下识别的稳健性。
我们使用六种不同的递归特征消除方法从 TCGA 肝癌数据中选择 HCC 的基因特征。六个选定子集中共有的基因被提议作为稳健的生物标志物。我们使用 Akaike 信息准则(AIC)来解释特征选择的优化过程,这为机器学习方法中的特征选择提供了统计解释。并且我们使用了几种方法来验证筛选出的生物标志物。
在本文中,我们提出了一种从基因表达数据中发现 HCC 生物标志物的稳健方法。具体来说,我们实现了基于六种不同分类算法的递归特征消除交叉验证(RFE-CV)方法。不同方法发现的基因集之间的重叠被称为鉴定的生物标志物。我们使用统计学中的 AIC 对基于机器学习的特征选择过程进行了解释。此外,通过 AIC 最小理论的向后逻辑逐步回归选择的特征完全包含在鉴定的生物标志物中。通过分类结果,验证了可解释的稳健生物标志物发现方法的优越性。
发现基因子集之间的重叠包含了 6 个分类器的 RFE-CV 选择的不同定量特征。模型选择中的 AIC 值为通过机器学习进行生物标志物发现的特征选择过程提供了理论基础。此外,在更优化选择的子集中包含的基因具有更好的生物学意义和含义。通过从不同分类器中选择的生物标志物的交集,提高了特征选择的质量。这是一种适用于从高通量数据中筛选复杂疾病生物标志物的通用方法。