Zeng Xueqiang, Luo Gang
Computer Center, Nanchang University, 999 Xuefu Road, Nanchang, 330031 Jiangxi People's Republic of China.
Department of Biomedical Informatics and Medical Education, University of Washington, UW Medicine South Lake Union, 850 Republican Street, Building C, Box 358047, Seattle, WA 98109 USA.
Health Inf Sci Syst. 2017 Sep 27;5(1):2. doi: 10.1007/s13755-017-0023-z. eCollection 2017 Dec.
Machine learning is broadly used for clinical data analysis. Before training a model, a machine learning algorithm must be selected. Also, the values of one or more model parameters termed hyper-parameters must be set. Selecting algorithms and hyper-parameter values requires advanced machine learning knowledge and many labor-intensive manual iterations. To lower the bar to machine learning, miscellaneous automatic selection methods for algorithms and/or hyper-parameter values have been proposed. Existing automatic selection methods are inefficient on large data sets. This poses a challenge for using machine learning in the clinical big data era.
To address the challenge, this paper presents progressive sampling-based Bayesian optimization, an efficient and automatic selection method for both algorithms and hyper-parameter values.
We report an implementation of the method. We show that compared to a state of the art automatic selection method, our method can significantly reduce search time, classification error rate, and standard deviation of error rate due to randomization.
This is major progress towards enabling fast turnaround in identifying high-quality solutions required by many machine learning-based clinical data analysis tasks.
机器学习广泛应用于临床数据分析。在训练模型之前,必须选择一种机器学习算法。此外,还必须设置一个或多个称为超参数的模型参数的值。选择算法和超参数值需要先进的机器学习知识以及许多劳动密集型的手动迭代。为了降低机器学习的门槛,人们提出了各种用于算法和/或超参数值的自动选择方法。现有的自动选择方法在大数据集上效率低下。这给临床大数据时代使用机器学习带来了挑战。
为应对这一挑战,本文提出了基于渐进采样的贝叶斯优化方法,这是一种用于算法和超参数值的高效自动选择方法。
我们报告了该方法的一个实现。我们表明,与一种先进的自动选择方法相比,我们的方法可以显著减少搜索时间、分类错误率以及由于随机化导致的错误率标准差。
这是在实现快速周转以识别许多基于机器学习的临床数据分析任务所需的高质量解决方案方面取得的重大进展。