Suppr超能文献

用于电子健康记录表型分析的成本敏感主动学习

Cost-sensitive Active Learning for Phenotyping of Electronic Health Records.

作者信息

Ji Zongcheng, Wei Qiang, Franklin Amy, Cohen Trevor, Xu Hua

机构信息

School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.

Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA.

出版信息

AMIA Jt Summits Transl Sci Proc. 2019 May 6;2019:829-838. eCollection 2019.

Abstract

Developing high-throughput and high-performance phenotyping algorithms is critical to the secondary use of electronic health records for clinical research. Supervised machine learning-based methods have shown good performance, but often require large annotated datasets that are costly to build. Simulation studies have shown that active learning (AL) could reduce the number of annotated samples while improving the model performance when assuming that the time of labeling each sample is the same (i.e., cost-insensitive). In this study, we proposed a cost- sensitive AL (CostAL) algorithm for clinical phenotyping, using the identification of breast cancer patients as a use case. CostAL implements a linear regression model to estimate the actual time required for annotating each individual sample. We recruited two annotators to manual review medical records of 766 potential breast cancer patients and recorded the actual time of annotating each sample. We then compared CostAL, AL, and passive learning (PL, aka random sampling) using this annotated dataset and generated learning curves for each method. Our experimental results showed that CostAL achieved the highest area under the curve (AUC) score among the three algorithms (PL, AL, and CostAL are 0.784, 0.8501, and 0.8673 for user 1 and 0.8006, 0.8806 and 0.9006 for user 2). To achieve an accuracy of 0.94, AL and CostAL could save 36% and 60% annotation time for user 1 and 53% and 70% annotation time for user 2, when they were compared with PL, indicating the value of cost-sensitive AL approaches.

摘要

开发高通量和高性能的表型分析算法对于电子健康记录在临床研究中的二次利用至关重要。基于监督式机器学习的方法已显示出良好的性能,但通常需要构建成本高昂的大型注释数据集。模拟研究表明,在假设标记每个样本的时间相同(即成本不敏感)的情况下,主动学习(AL)可以减少注释样本的数量,同时提高模型性能。在本研究中,我们以乳腺癌患者的识别为例,提出了一种用于临床表型分析的成本敏感主动学习(CostAL)算法。CostAL实现了一个线性回归模型来估计注释每个单独样本所需的实际时间。我们招募了两名注释人员手动审查766名潜在乳腺癌患者的病历,并记录注释每个样本的实际时间。然后,我们使用这个注释数据集比较了CostAL、AL和被动学习(PL,即随机抽样),并为每种方法生成了学习曲线。我们的实验结果表明,CostAL在三种算法中实现了最高的曲线下面积(AUC)分数(对于用户1,PL、AL和CostAL分别为0.784、0.8501和0.8673;对于用户2,分别为0.8006、0.8806和0.9006)。与PL相比,为了达到0.94的准确率,对于用户1,AL和CostAL可以节省36%和60%的注释时间,对于用户2,可以节省53%和70%的注释时间,这表明了成本敏感主动学习方法的价值。

相似文献

6
Applying active learning to supervised word sense disambiguation in MEDLINE.将主动学习应用于 MEDLINE 中的监督词义消歧。
J Am Med Inform Assoc. 2013 Sep-Oct;20(5):1001-6. doi: 10.1136/amiajnl-2012-001244. Epub 2013 Jan 30.
7
Active learning reduces annotation time for clinical concept extraction.主动学习减少了临床概念提取的标注时间。
Int J Med Inform. 2017 Oct;106:25-31. doi: 10.1016/j.ijmedinf.2017.08.001. Epub 2017 Aug 5.

本文引用的文献

3
Expert guided natural language processing using one-class classification.使用单类分类的专家指导自然语言处理。
J Am Med Inform Assoc. 2015 Sep;22(5):962-6. doi: 10.1093/jamia/ocv010. Epub 2015 Jun 10.
7
Next-generation phenotyping of electronic health records.电子健康记录的下一代表型分析。
J Am Med Inform Assoc. 2013 Jan 1;20(1):117-21. doi: 10.1136/amiajnl-2012-001145. Epub 2012 Sep 6.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验