Dligach Dmitriy, Miller Timothy, Savova Guergana K
Boston Children's Hospital and Harvard Medical School, Boston, MA.
AMIA Annu Symp Proc. 2015 Nov 5;2015:502-11. eCollection 2015.
Supervised learning is the dominant approach to automatic electronic health records-based phenotyping, but it is expensive due to the cost of manual chart review. Semi-supervised learning takes advantage of both scarce labeled and plentiful unlabeled data. In this work, we study a family of semi-supervised learning algorithms based on Expectation Maximization (EM) in the context of several phenotyping tasks. We first experiment with the basic EM algorithm. When the modeling assumptions are violated, basic EM leads to inaccurate parameter estimation. Augmented EM attenuates this shortcoming by introducing a weighting factor that downweights the unlabeled data. Cross-validation does not always lead to the best setting of the weighting factor and other heuristic methods may be preferred. We show that accurate phenotyping models can be trained with only a few hundred labeled (and a large number of unlabeled) examples, potentially providing substantial savings in the amount of the required manual chart review.
监督学习是基于自动电子健康记录进行表型分析的主要方法,但由于人工病历审查成本高昂,该方法成本较高。半监督学习利用了少量的标记数据和大量的未标记数据。在这项工作中,我们在几个表型分析任务的背景下,研究了一系列基于期望最大化(EM)的半监督学习算法。我们首先对基本的EM算法进行实验。当建模假设不成立时,基本的EM算法会导致参数估计不准确。增强EM算法通过引入一个对未标记数据进行加权的加权因子来减轻这一缺点。交叉验证并不总是能得到加权因子的最佳设置,其他启发式方法可能更可取。我们表明,仅用几百个标记(以及大量未标记)示例就可以训练出准确的表型分析模型,这有可能大幅节省所需的人工病历审查工作量。