Suppr超能文献

基于成本意识的临床文本命名实体识别的主动学习。

Cost-aware active learning for named entity recognition in clinical text.

机构信息

School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA.

Pieces Technologies Inc, Dallas, Texas, USA.

出版信息

J Am Med Inform Assoc. 2019 Nov 1;26(11):1314-1322. doi: 10.1093/jamia/ocz102.

Abstract

OBJECTIVE

Active Learning (AL) attempts to reduce annotation cost (ie, time) by selecting the most informative examples for annotation. Most approaches tacitly (and unrealistically) assume that the cost for annotating each sample is identical. This study introduces a cost-aware AL method, which simultaneously models both the annotation cost and the informativeness of the samples and evaluates both via simulation and user studies.

MATERIALS AND METHODS

We designed a novel, cost-aware AL algorithm (Cost-CAUSE) for annotating clinical named entities; we first utilized lexical and syntactic features to estimate annotation cost, then we incorporated this cost measure into an existing AL algorithm. Using the 2010 i2b2/VA data set, we then conducted a simulation study comparing Cost-CAUSE with noncost-aware AL methods, and a user study comparing Cost-CAUSE with passive learning.

RESULTS

Our cost model fit empirical annotation data well, and Cost-CAUSE increased the simulation area under the learning curve (ALC) scores by up to 5.6% and 4.9%, compared with random sampling and alternate AL methods. Moreover, in a user annotation task, Cost-CAUSE outperformed passive learning on the ALC score and reduced annotation time by 20.5%-30.2%.

DISCUSSION

Although AL has proven effective in simulations, our user study shows that a real-world environment is far more complex. Other factors have a noticeable effect on the AL method, such as the annotation accuracy of users, the tiredness of users, and even the physical and mental condition of users.

CONCLUSION

Cost-CAUSE saves significant annotation cost compared to random sampling.

摘要

目的

主动学习(AL)试图通过选择最具信息量的示例来减少注释成本(即时间)。大多数方法都在潜意识(和不切实际)地假设标注每个样本的成本是相同的。本研究介绍了一种具有成本意识的 AL 方法,该方法同时对注释成本和样本的信息量进行建模,并通过模拟和用户研究来评估这两个方面。

材料与方法

我们设计了一种新颖的、具有成本意识的用于标注临床命名实体的 AL 算法(Cost-CAUSE);我们首先利用词汇和句法特征来估计注释成本,然后将该成本度量纳入现有的 AL 算法中。使用 2010 年的 i2b2/VA 数据集,我们进行了一项模拟研究,比较了 Cost-CAUSE 与非成本意识的 AL 方法,以及一项用户研究,比较了 Cost-CAUSE 与被动学习。

结果

我们的成本模型很好地拟合了经验注释数据,与随机抽样和替代 AL 方法相比,Cost-CAUSE 提高了模拟学习曲线(ALC)得分高达 5.6%和 4.9%。此外,在用户注释任务中,Cost-CAUSE 在 ALC 得分上优于被动学习,并将注释时间减少了 20.5%-30.2%。

讨论

尽管 AL 在模拟中已被证明是有效的,但我们的用户研究表明,现实环境要复杂得多。其他因素对 AL 方法有明显的影响,例如用户的注释准确性、用户的疲劳程度,甚至用户的身心状况。

结论

与随机抽样相比,Cost-CAUSE 可显著节省注释成本。

相似文献

4
Active learning reduces annotation time for clinical concept extraction.主动学习减少了临床概念提取的标注时间。
Int J Med Inform. 2017 Oct;106:25-31. doi: 10.1016/j.ijmedinf.2017.08.001. Epub 2017 Aug 5.

本文引用的文献

1
Clinical information extraction applications: A literature review.临床信息提取应用:文献综述。
J Biomed Inform. 2018 Jan;77:34-49. doi: 10.1016/j.jbi.2017.11.011. Epub 2017 Nov 21.
2
Active learning reduces annotation time for clinical concept extraction.主动学习减少了临床概念提取的标注时间。
Int J Med Inform. 2017 Oct;106:25-31. doi: 10.1016/j.ijmedinf.2017.08.001. Epub 2017 Aug 5.
4
What do we mean by prediction in language comprehension?我们所说的语言理解中的预测是什么意思?
Lang Cogn Neurosci. 2016;31(1):32-59. doi: 10.1080/23273798.2015.1102299. Epub 2015 Nov 13.
6
Active learning: a step towards automating medical concept extraction.主动学习:迈向医学概念提取自动化的一步。
J Am Med Inform Assoc. 2016 Mar;23(2):289-96. doi: 10.1093/jamia/ocv069. Epub 2015 Aug 7.
7
2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.2010 i2b2/VA 挑战赛:临床文本中的概念、断言和关系
J Am Med Inform Assoc. 2011 Sep-Oct;18(5):552-6. doi: 10.1136/amiajnl-2011-000203. Epub 2011 Jun 16.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验