Department of Biomedical Informatics, Vanderbilt University, School of Medicine, Nashville, Tennessee, USA.
J Am Med Inform Assoc. 2013 Dec;20(e2):e253-9. doi: 10.1136/amiajnl-2013-001945. Epub 2013 Jul 13.
Generalizable, high-throughput phenotyping methods based on supervised machine learning (ML) algorithms could significantly accelerate the use of electronic health records data for clinical and translational research. However, they often require large numbers of annotated samples, which are costly and time-consuming to review. We investigated the use of active learning (AL) in ML-based phenotyping algorithms.
We integrated an uncertainty sampling AL approach with support vector machines-based phenotyping algorithms and evaluated its performance using three annotated disease cohorts including rheumatoid arthritis (RA), colorectal cancer (CRC), and venous thromboembolism (VTE). We investigated performance using two types of feature sets: unrefined features, which contained at least all clinical concepts extracted from notes and billing codes; and a smaller set of refined features selected by domain experts. The performance of the AL was compared with a passive learning (PL) approach based on random sampling.
Our evaluation showed that AL outperformed PL on three phenotyping tasks. When unrefined features were used in the RA and CRC tasks, AL reduced the number of annotated samples required to achieve an area under the curve (AUC) score of 0.95 by 68% and 23%, respectively. AL also achieved a reduction of 68% for VTE with an optimal AUC of 0.70 using refined features. As expected, refined features improved the performance of phenotyping classifiers and required fewer annotated samples.
This study demonstrated that AL can be useful in ML-based phenotyping methods. Moreover, AL and feature engineering based on domain knowledge could be combined to develop efficient and generalizable phenotyping methods.
基于监督机器学习(ML)算法的可推广、高通量表型方法可以显著加速电子健康记录数据在临床和转化研究中的应用。然而,它们通常需要大量注释样本,这些样本的审查既昂贵又耗时。我们研究了主动学习(AL)在基于 ML 的表型算法中的应用。
我们将不确定性抽样 AL 方法与基于支持向量机的表型算法集成,并使用包括类风湿关节炎(RA)、结直肠癌(CRC)和静脉血栓栓塞(VTE)在内的三个注释疾病队列来评估其性能。我们使用两种类型的特征集来研究性能:未精炼特征集,其中包含从笔记和计费代码中提取的至少所有临床概念;以及由领域专家选择的较小精炼特征集。比较了 AL 与基于随机抽样的被动学习(PL)方法的性能。
我们的评估表明,在三个表型任务中,AL 优于 PL。当在 RA 和 CRC 任务中使用未精炼特征时,AL 将获得 AUC 评分为 0.95 所需的注释样本数量分别减少了 68%和 23%。当使用精炼特征时,AL 还将 VTE 的 AUC 优化为 0.70,减少了 68%。正如预期的那样,精炼特征提高了表型分类器的性能,所需的注释样本数量也更少。
这项研究表明,AL 可用于基于 ML 的表型方法。此外,基于领域知识的 AL 和特征工程可以结合起来开发高效且可推广的表型方法。