Suppr超能文献

通过多种特征选择方法从高通量数据中发现用于肝细胞癌的稳健生物标志物。

Robust biomarker discovery for hepatocellular carcinoma from high-throughput data by multiple feature selection methods.

机构信息

Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, 250061, Shandong, China.

Center for Intelligent Medicine, Shandong University, Jinan, 250061, Shandong, China.

出版信息

BMC Med Genomics. 2021 Aug 25;14(Suppl 1):112. doi: 10.1186/s12920-021-00957-4.

Abstract

BACKGROUND

Hepatocellular carcinoma (HCC) is one of the most common cancers. The discovery of specific genes severing as biomarkers is of paramount significance for cancer diagnosis and prognosis. The high-throughput omics data generated by the cancer genome atlas (TCGA) consortium provides a valuable resource for the discovery of HCC biomarker genes. Numerous methods have been proposed to select cancer biomarkers. However, these methods have not investigated the robustness of identification with different feature selection techniques.

METHODS

We use six different recursive feature elimination methods to select the gene signiatures of HCC from TCGA liver cancer data. The genes shared in the six selected subsets are proposed as robust biomarkers. Akaike information criterion (AIC) is employed to explain the optimization process of feature selection, which provides a statistical interpretation for the feature selection in machine learning methods. And we use several methods to validate the screened biomarkers.

RESULTS

In this paper, we propose a robust method for discovering biomarker genes for HCC from gene expression data. Specifically, we implement recursive feature elimination cross-validation (RFE-CV) methods based on six different classication algorithms. The overlaps in the discovered gene sets via different methods are referred as the identified biomarkers. We give an interpretation of the feature selection process based on machine learning using AIC in statistics. Furthermore, the features selected by the backward logistic stepwise regression via AIC minimum theory are completely contained in the identified biomarkers. Through the classification results, the superiority of interpretable robust biomarker discovery method is verified.

CONCLUSIONS

It is found that overlaps among gene subsets contain different quantitative features selected by the RFE-CV of 6 classifiers. The AIC values in the model selection provide a theoretical foundation for the feature selection process of biomarker discovery via machine learning. What's more, genes containing in more optimally selected subsets make better biological sense and implication. The quality of feature selection is improved by the intersections of biomarkers selected from different classifiers. This is a general method suitable for screening biomarkers of complex diseases from high-throughput data.

摘要

背景

肝细胞癌(HCC)是最常见的癌症之一。发现特异性基因作为生物标志物对癌症的诊断和预后具有重要意义。癌症基因组图谱(TCGA)联盟生成的高通量组学数据为发现 HCC 生物标志物基因提供了宝贵的资源。已经提出了许多用于选择癌症生物标志物的方法。然而,这些方法并没有研究不同特征选择技术下识别的稳健性。

方法

我们使用六种不同的递归特征消除方法从 TCGA 肝癌数据中选择 HCC 的基因特征。六个选定子集中共有的基因被提议作为稳健的生物标志物。我们使用 Akaike 信息准则(AIC)来解释特征选择的优化过程,这为机器学习方法中的特征选择提供了统计解释。并且我们使用了几种方法来验证筛选出的生物标志物。

结果

在本文中,我们提出了一种从基因表达数据中发现 HCC 生物标志物的稳健方法。具体来说,我们实现了基于六种不同分类算法的递归特征消除交叉验证(RFE-CV)方法。不同方法发现的基因集之间的重叠被称为鉴定的生物标志物。我们使用统计学中的 AIC 对基于机器学习的特征选择过程进行了解释。此外,通过 AIC 最小理论的向后逻辑逐步回归选择的特征完全包含在鉴定的生物标志物中。通过分类结果,验证了可解释的稳健生物标志物发现方法的优越性。

结论

发现基因子集之间的重叠包含了 6 个分类器的 RFE-CV 选择的不同定量特征。模型选择中的 AIC 值为通过机器学习进行生物标志物发现的特征选择过程提供了理论基础。此外,在更优化选择的子集中包含的基因具有更好的生物学意义和含义。通过从不同分类器中选择的生物标志物的交集,提高了特征选择的质量。这是一种适用于从高通量数据中筛选复杂疾病生物标志物的通用方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00d4/8386074/203ae08d488a/12920_2021_957_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验