School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA.
Pieces Technologies Inc, Dallas, Texas, USA.
J Am Med Inform Assoc. 2019 Nov 1;26(11):1314-1322. doi: 10.1093/jamia/ocz102.
Active Learning (AL) attempts to reduce annotation cost (ie, time) by selecting the most informative examples for annotation. Most approaches tacitly (and unrealistically) assume that the cost for annotating each sample is identical. This study introduces a cost-aware AL method, which simultaneously models both the annotation cost and the informativeness of the samples and evaluates both via simulation and user studies.
We designed a novel, cost-aware AL algorithm (Cost-CAUSE) for annotating clinical named entities; we first utilized lexical and syntactic features to estimate annotation cost, then we incorporated this cost measure into an existing AL algorithm. Using the 2010 i2b2/VA data set, we then conducted a simulation study comparing Cost-CAUSE with noncost-aware AL methods, and a user study comparing Cost-CAUSE with passive learning.
Our cost model fit empirical annotation data well, and Cost-CAUSE increased the simulation area under the learning curve (ALC) scores by up to 5.6% and 4.9%, compared with random sampling and alternate AL methods. Moreover, in a user annotation task, Cost-CAUSE outperformed passive learning on the ALC score and reduced annotation time by 20.5%-30.2%.
Although AL has proven effective in simulations, our user study shows that a real-world environment is far more complex. Other factors have a noticeable effect on the AL method, such as the annotation accuracy of users, the tiredness of users, and even the physical and mental condition of users.
Cost-CAUSE saves significant annotation cost compared to random sampling.
主动学习(AL)试图通过选择最具信息量的示例来减少注释成本(即时间)。大多数方法都在潜意识(和不切实际)地假设标注每个样本的成本是相同的。本研究介绍了一种具有成本意识的 AL 方法,该方法同时对注释成本和样本的信息量进行建模,并通过模拟和用户研究来评估这两个方面。
我们设计了一种新颖的、具有成本意识的用于标注临床命名实体的 AL 算法(Cost-CAUSE);我们首先利用词汇和句法特征来估计注释成本,然后将该成本度量纳入现有的 AL 算法中。使用 2010 年的 i2b2/VA 数据集,我们进行了一项模拟研究,比较了 Cost-CAUSE 与非成本意识的 AL 方法,以及一项用户研究,比较了 Cost-CAUSE 与被动学习。
我们的成本模型很好地拟合了经验注释数据,与随机抽样和替代 AL 方法相比,Cost-CAUSE 提高了模拟学习曲线(ALC)得分高达 5.6%和 4.9%。此外,在用户注释任务中,Cost-CAUSE 在 ALC 得分上优于被动学习,并将注释时间减少了 20.5%-30.2%。
尽管 AL 在模拟中已被证明是有效的,但我们的用户研究表明,现实环境要复杂得多。其他因素对 AL 方法有明显的影响,例如用户的注释准确性、用户的疲劳程度,甚至用户的身心状况。
与随机抽样相比,Cost-CAUSE 可显著节省注释成本。