Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
Department of Cardiology, Division Heart and Lungs, Utrecht, The Netherlands.
Bioinformatics. 2020 Mar 1;36(6):1772-1778. doi: 10.1093/bioinformatics/btz796.
Selecting the optimal machine learning (ML) model for a given dataset is often challenging. Automated ML (AutoML) has emerged as a powerful tool for enabling the automatic selection of ML methods and parameter settings for the prediction of biomedical endpoints. Here, we apply the tree-based pipeline optimization tool (TPOT) to predict angiographic diagnoses of coronary artery disease (CAD). With TPOT, ML models are represented as expression trees and optimal pipelines discovered using a stochastic search method called genetic programing. We provide some guidelines for TPOT-based ML pipeline selection and optimization-based on various clinical phenotypes and high-throughput metabolic profiles in the Angiography and Genes Study (ANGES).
We analyzed nuclear magnetic resonance-derived lipoprotein and metabolite profiles in the ANGES cohort with a goal to identify the role of non-obstructive CAD patients in CAD diagnostics. We performed a comparative analysis of TPOT-generated ML pipelines with selected ML classifiers, optimized with a grid search approach, applied to two phenotypic CAD profiles. As a result, TPOT-generated ML pipelines that outperformed grid search optimized models across multiple performance metrics including balanced accuracy and area under the precision-recall curve. With the selected models, we demonstrated that the phenotypic profile that distinguishes non-obstructive CAD patients from no CAD patients is associated with higher precision, suggesting a discrepancy in the underlying processes between these phenotypes.
TPOT is freely available via http://epistasislab.github.io/tpot/.
Supplementary data are available at Bioinformatics online.
为给定的数据集选择最佳的机器学习 (ML) 模型通常具有挑战性。自动化机器学习 (AutoML) 已经成为一种强大的工具,可以自动选择 ML 方法和参数设置,以预测生物医学终点。在这里,我们应用基于树的管道优化工具 (TPOT) 来预测冠状动脉疾病 (CAD) 的血管造影诊断。使用 TPOT,ML 模型表示为表达式树,并使用称为遗传编程的随机搜索方法发现最优管道。我们提供了一些基于 TPOT 的 ML 管道选择和优化的指南,这些指南基于 Angiography 和 Genes 研究 (ANGES) 中的各种临床表型和高通量代谢谱。
我们分析了 ANGES 队列中基于核磁共振的脂蛋白和代谢物谱,目的是确定非阻塞性 CAD 患者在 CAD 诊断中的作用。我们对 TPOT 生成的 ML 管道与使用网格搜索方法优化的选定 ML 分类器进行了比较分析,应用于两种表型 CAD 谱。结果表明,TPOT 生成的 ML 管道在多个性能指标(包括平衡准确性和精度-召回曲线下面积)上优于网格搜索优化模型。使用选定的模型,我们证明了区分非阻塞性 CAD 患者和无 CAD 患者的表型谱与更高的精度相关,这表明这些表型之间存在潜在过程的差异。
TPOT 可通过 http://epistasislab.github.io/tpot/ 免费获得。
补充数据可在生物信息学在线获得。