基于树的自动化机器学习中嵌入协变量调整，用于生物医学大数据分析。

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses.

机构信息

Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA.

Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA.

出版信息

BMC Bioinformatics. 2020 Oct 1;21(1):430. doi: 10.1186/s12859-020-03755-4.

BACKGROUND

A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis.

RESULTS

We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids 'leakage' during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj .

CONCLUSIONS

In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.

背景

生物信息学中的一个典型任务是确定哪些特征与感兴趣的目标结果相关，并构建预测模型。自动化机器学习（AutoML）系统，如基于树的管道优化工具（TPOT），是一种很有吸引力的方法。然而，在生物医学数据中，通常存在研究中受试者的基线特征或批次效应，需要对其进行调整，以便更好地分离目标特征对目标的影响。因此，对 AutoML 应用于生物医学大数据分析来说，进行协变量调整的能力变得尤为重要。

结果

我们开发了一种在 TPOT 中调整影响特征和/或目标的协变量的方法。我们的方法基于以避免交叉验证训练过程中“泄漏”的方式回归协变量。我们描述了这种方法在毒理学基因组学和精神分裂症基因表达数据集上的应用。本文讨论的 TPOT 扩展可在 https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj 获得。

结论

在这项工作中，我们解决了 AutoML 中的一个重要需求，这对于生物信息学和医学信息学的应用来说尤为重要，即协变量调整。为此，我们对基于遗传编程的 AutoML 方法 TPOT 进行了重大扩展。我们通过对大型毒理学基因组学和差异基因表达数据的应用，展示了该扩展的实用性。该方法在生物医学领域的许多其他场景中也具有普遍适用性。