Key Laboratory of High Confidence Software Technologies (MOE), School of CS, Peking University, Beijing, China.
Institute of Computational Social Science, Peking University (Qingdao), Qingdao, China.
Bioinformatics. 2022 Jun 27;38(13):3415-3421. doi: 10.1093/bioinformatics/btac334.
The emergence of next-generation sequencing techniques opens up tremendous opportunities for researchers to uncover the basic mechanisms of disease at the molecular level. Recently, automatic machine learning (AutoML) frameworks have been employed for genomic and epigenomic data analysis. However, to analyze those high-dimensional data, existing AutoML frameworks suffer from the following issues: (i) they could not effectively filter out the redundant features from the original data, and (ii) they usually obey the rule of feature engineering first and algorithm hyper-parameter tuning later to build the machine learning pipeline, which could lead to sub-optimal outcomes. Thus, it is an urgent need to design a new AutoML framework for high-dimensional omics data analysis.
We introduce a new method: AutoDC, a tailored AutoML framework, for different disease classification based on gene expression data. AutoDC designs two novel optimization strategies to improve the performance. One is that AutoDC designs a novel two-stage feature selection method to select the features with high gene contribution scores. The other is that AutoDC proposes a novel optimization method, based on a two-layer Multi-Armed Bandit framework, to jointly optimize the feature engineering, algorithm selection and algorithm hyper-parameter tuning. We apply our framework to two public gene expression datasets. Compared with three state-of-the-art AutoML frameworks, AutoDC could effectively classify diseases with higher predictive accuracy.
The data and codes of AutoDC are available at https://github.com/dingdian110/AutoDC. The data underlying this article are available in the article and in its online supplementary material.
Supplementary data are available at Bioinformatics online.
下一代测序技术的出现为研究人员提供了巨大的机会,可以从分子水平揭示疾病的基本机制。最近,自动机器学习 (AutoML) 框架已被用于基因组和表观基因组数据分析。然而,为了分析这些高维数据,现有的 AutoML 框架存在以下问题:(i) 它们无法有效地从原始数据中筛选出冗余特征,(ii) 它们通常遵循特征工程优先和算法超参数调整后构建机器学习管道的规则,这可能导致次优结果。因此,迫切需要为高维组学数据分析设计新的 AutoML 框架。
我们引入了一种新的方法:AutoDC,一种针对基于基因表达数据的不同疾病分类的定制化 AutoML 框架。AutoDC 设计了两种新颖的优化策略来提高性能。一种是 AutoDC 设计了一种新颖的两阶段特征选择方法来选择具有高基因贡献分数的特征。另一种是 AutoDC 提出了一种新颖的优化方法,基于两层多臂老虎机框架,共同优化特征工程、算法选择和算法超参数调整。我们将我们的框架应用于两个公共基因表达数据集。与三个最先进的 AutoML 框架相比,AutoDC 可以有效地以更高的预测准确性对疾病进行分类。
AutoDC 的数据和代码可在 https://github.com/dingdian110/AutoDC 上获得。本文的数据可在文章和其在线补充材料中获得。
补充数据可在 Bioinformatics 在线获得。