主动学习在临床文本分类中的应用：它比随机抽样更好吗？

Active learning for clinical text classification: is it better than random sampling?

机构信息

Departamento de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de Concepción, Concepción, Chile.

出版信息

J Am Med Inform Assoc. 2012 Sep-Oct;19(5):809-16. doi: 10.1136/amiajnl-2011-000648. Epub 2012 Jun 15.

DOI:10.1136/amiajnl-2011-000648

PMID:22707743

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3422824/

Abstract

OBJECTIVE

This study explores active learning algorithms as a way to reduce the requirements for large training sets in medical text classification tasks.

DESIGN

Three existing active learning algorithms (distance-based (DIST), diversity-based (DIV), and a combination of both (CMB)) were used to classify text from five datasets. The performance of these algorithms was compared to that of passive learning on the five datasets. We then conducted a novel investigation of the interaction between dataset characteristics and the performance results.

MEASUREMENTS

Classification accuracy and area under receiver operating characteristics (ROC) curves for each algorithm at different sample sizes were generated. The performance of active learning algorithms was compared with that of passive learning using a weighted mean of paired differences. To determine why the performance varies on different datasets, we measured the diversity and uncertainty of each dataset using relative entropy and correlated the results with the performance differences.

RESULTS

The DIST and CMB algorithms performed better than passive learning. With a statistical significance level set at 0.05, DIST outperformed passive learning in all five datasets, while CMB was found to be better than passive learning in four datasets. We found strong correlations between the dataset diversity and the DIV performance, as well as the dataset uncertainty and the performance of the DIST algorithm.

CONCLUSION

For medical text classification, appropriate active learning algorithms can yield performance comparable to that of passive learning with considerably smaller training sets. In particular, our results suggest that DIV performs better on data with higher diversity and DIST on data with lower uncertainty.

摘要

目的

本研究探讨了主动学习算法在医学文本分类任务中减少对大型训练集需求的方法。

设计

使用三种现有的主动学习算法（基于距离的（DIST）、基于多样性的（DIV）和两者的组合（CMB））对来自五个数据集的文本进行分类。将这些算法的性能与五个数据集上的被动学习进行了比较。然后，我们对数据集特征与性能结果之间的交互作用进行了新的研究。

测量

对于每个算法在不同样本量下的分类准确性和接收器工作特征（ROC）曲线下的面积，生成了不同算法的结果。使用配对差异的加权平均值比较主动学习算法和被动学习的性能。为了确定为什么在不同的数据集中性能会有所不同，我们使用相对熵测量每个数据集的多样性和不确定性，并将结果与性能差异相关联。

结果

DIST 和 CMB 算法的性能优于被动学习。在设定的统计显著性水平为 0.05 的情况下，DIST 在所有五个数据集中均优于被动学习，而 CMB 在四个数据集中均优于被动学习。我们发现数据集多样性与 DIV 性能之间以及数据集不确定性与 DIST 算法性能之间存在很强的相关性。

结论

对于医学文本分类，适当的主动学习算法可以在使用较小的训练集的情况下获得与被动学习相当的性能。特别是，我们的结果表明，在多样性较高的数据上，DIV 的性能更好，而在不确定性较低的数据上，DIST 的性能更好。

相似文献

Active learning for clinical text classification: is it better than random sampling?

J Am Med Inform Assoc. 2012 Sep-Oct;19(5):809-16. doi: 10.1136/amiajnl-2011-000648. Epub 2012 Jun 15.

A study of active learning methods for named entity recognition in clinical text.

J Biomed Inform. 2015 Dec;58:11-18. doi: 10.1016/j.jbi.2015.09.010. Epub 2015 Sep 15.

Portable automatic text classification for adverse drug reaction detection via multi-corpus training.

J Biomed Inform. 2015 Feb;53:196-207. doi: 10.1016/j.jbi.2014.11.002. Epub 2014 Nov 8.

Applying active learning to assertion classification of concepts in clinical text.

J Biomed Inform. 2012 Apr;45(2):265-72. doi: 10.1016/j.jbi.2011.11.003. Epub 2011 Nov 22.

Inter-labeler and intra-labeler variability of condition severity classification models using active and passive learning methods.

Artif Intell Med. 2017 Sep;81:12-32. doi: 10.1016/j.artmed.2017.03.003. Epub 2017 Apr 27.

Medical text representations for inductive learning.

Proc AMIA Symp. 2000:923-7.

A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance.

BMC Med Res Methodol. 2022 Jul 2;22(1):181. doi: 10.1186/s12874-022-01665-y.

Text classification algorithm of tourist attractions subcategories with modified TF-IDF and Word2Vec.

PLoS One. 2024 Oct 18;19(10):e0305095. doi: 10.1371/journal.pone.0305095. eCollection 2024.

Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach.

BMC Med Inform Decis Mak. 2017 Dec 1;17(1):155. doi: 10.1186/s12911-017-0556-8.

A clinical text classification paradigm using weak supervision and deep representation.

BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.

引用本文的文献

Rethinking Domain-Specific Pretraining by Supervised or Self-Supervised Learning for Chest Radiograph Classification: A Comparative Study Against ImageNet Counterparts in Cold-Start Active Learning.

Health Care Sci. 2025 Apr 6;4(2):110-143. doi: 10.1002/hcs2.70009. eCollection 2025 Apr.

Utilizing active learning strategies in machine-assisted annotation for clinical named entity recognition: a comprehensive analysis considering annotation costs and target effectiveness.

J Am Med Inform Assoc. 2024 Nov 1;31(11):2632-2640. doi: 10.1093/jamia/ocae197.

Active Learning for Multi-way Sensitivity Analysis with Application to Disease Screening Modeling.

J Healthc Inform Res. 2022 Jul 15;6(3):317-343. doi: 10.1007/s41666-022-00117-y. eCollection 2022 Sep.

When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification.

BMC Med Inform Decis Mak. 2022 Apr 5;21(Suppl 9):377. doi: 10.1186/s12911-022-01829-2.

Improving Diabetes-Related Biomedical Literature Exploration in the Clinical Decision-making Process via Interactive Classification and Topic Discovery: Methodology Development Study.

J Med Internet Res. 2022 Jan 18;24(1):e27434. doi: 10.2196/27434.

Deep active learning for classifying cancer pathology reports.

BMC Bioinformatics. 2021 Mar 9;22(1):113. doi: 10.1186/s12859-021-04047-1.

Clinical text classification with rule-based features and knowledge-guided convolutional neural networks.

BMC Med Inform Decis Mak. 2019 Apr 4;19(Suppl 3):71. doi: 10.1186/s12911-019-0781-4.

Evaluating active learning methods for annotating semantic predications.

JAMIA Open. 2018 Oct;1(2):275-282. doi: 10.1093/jamiaopen/ooy021. Epub 2018 Jun 27.

An active learning-enabled annotation system for clinical named entity recognition.

BMC Med Inform Decis Mak. 2017 Jul 5;17(Suppl 2):82. doi: 10.1186/s12911-017-0466-9.

Inter-labeler and intra-labeler variability of condition severity classification models using active and passive learning methods.

Artif Intell Med. 2017 Sep;81:12-32. doi: 10.1016/j.artmed.2017.03.003. Epub 2017 Apr 27.

本文引用的文献

Predicting sample size required for classification performance.

BMC Med Inform Decis Mak. 2012 Feb 15;12:8. doi: 10.1186/1472-6947-12-8.

Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model.

Psychol Med. 2012 Jan;42(1):41-50. doi: 10.1017/S0033291711000997. Epub 2011 Jun 20.

Active learning for an efficient training strategy of computer-aided diagnosis systems: application to diabetic retinopathy screening.

Med Image Comput Comput Assist Interv. 2010;13(Pt 3):603-10. doi: 10.1007/978-3-642-15711-0_75.

Accelerating the annotation of sparse named entities by dynamic sentence selection.

BMC Bioinformatics. 2008 Nov 19;9 Suppl 11(Suppl 11):S8. doi: 10.1186/1471-2105-9-S11-S8.

Classification models for the prediction of clinicians' information needs.

J Biomed Inform. 2009 Feb;42(1):82-9. doi: 10.1016/j.jbi.2008.07.001. Epub 2008 Jul 13.

Mayo clinic NLP system for patient smoking status identification.

J Am Med Inform Assoc. 2008 Jan-Feb;15(1):25-8. doi: 10.1197/jamia.M2437. Epub 2007 Oct 18.

Heuristic sample selection to minimize reference standard training set for a part-of-speech tagger.

J Am Med Inform Assoc. 2007 Sep-Oct;14(5):641-50. doi: 10.1197/jamia.M2392. Epub 2007 Jun 28.

Confidence-based active learning.

IEEE Trans Pattern Anal Mach Intell. 2006 Aug;28(8):1251-61. doi: 10.1109/TPAMI.2006.156.

Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system.

BMC Med Inform Decis Mak. 2006 Jul 26;6:30. doi: 10.1186/1472-6947-6-30.

Active learning with support vector machine applied to gene expression data for cancer classification.

J Chem Inf Comput Sci. 2004 Nov-Dec;44(6):1936-41. doi: 10.1021/ci049810a.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

主动学习在临床文本分类中的应用：它比随机抽样更好吗？

Active learning for clinical text classification: is it better than random sampling?

机构信息

出版信息

OBJECTIVE

DESIGN

MEASUREMENTS

RESULTS

CONCLUSION

目的

设计

测量

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献