Ji Zongcheng, Wei Qiang, Franklin Amy, Cohen Trevor, Xu Hua
School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.
Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA.
AMIA Jt Summits Transl Sci Proc. 2019 May 6;2019:829-838. eCollection 2019.
Developing high-throughput and high-performance phenotyping algorithms is critical to the secondary use of electronic health records for clinical research. Supervised machine learning-based methods have shown good performance, but often require large annotated datasets that are costly to build. Simulation studies have shown that active learning (AL) could reduce the number of annotated samples while improving the model performance when assuming that the time of labeling each sample is the same (i.e., cost-insensitive). In this study, we proposed a cost- sensitive AL (CostAL) algorithm for clinical phenotyping, using the identification of breast cancer patients as a use case. CostAL implements a linear regression model to estimate the actual time required for annotating each individual sample. We recruited two annotators to manual review medical records of 766 potential breast cancer patients and recorded the actual time of annotating each sample. We then compared CostAL, AL, and passive learning (PL, aka random sampling) using this annotated dataset and generated learning curves for each method. Our experimental results showed that CostAL achieved the highest area under the curve (AUC) score among the three algorithms (PL, AL, and CostAL are 0.784, 0.8501, and 0.8673 for user 1 and 0.8006, 0.8806 and 0.9006 for user 2). To achieve an accuracy of 0.94, AL and CostAL could save 36% and 60% annotation time for user 1 and 53% and 70% annotation time for user 2, when they were compared with PL, indicating the value of cost-sensitive AL approaches.
开发高通量和高性能的表型分析算法对于电子健康记录在临床研究中的二次利用至关重要。基于监督式机器学习的方法已显示出良好的性能,但通常需要构建成本高昂的大型注释数据集。模拟研究表明,在假设标记每个样本的时间相同(即成本不敏感)的情况下,主动学习(AL)可以减少注释样本的数量,同时提高模型性能。在本研究中,我们以乳腺癌患者的识别为例,提出了一种用于临床表型分析的成本敏感主动学习(CostAL)算法。CostAL实现了一个线性回归模型来估计注释每个单独样本所需的实际时间。我们招募了两名注释人员手动审查766名潜在乳腺癌患者的病历,并记录注释每个样本的实际时间。然后,我们使用这个注释数据集比较了CostAL、AL和被动学习(PL,即随机抽样),并为每种方法生成了学习曲线。我们的实验结果表明,CostAL在三种算法中实现了最高的曲线下面积(AUC)分数(对于用户1,PL、AL和CostAL分别为0.784、0.8501和0.8673;对于用户2,分别为0.8006、0.8806和0.9006)。与PL相比,为了达到0.94的准确率,对于用户1,AL和CostAL可以节省36%和60%的注释时间,对于用户2,可以节省53%和70%的注释时间,这表明了成本敏感主动学习方法的价值。