Wei Qiang, Franklin Amy, Cohen Trevor, Xu Hua
School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.
AMIA Annu Symp Proc. 2018 Dec 5;2018:1552-1560. eCollection 2018.
Building high-quality annotated clinical corpora is necessary for developing statistical Natural Language Processing (NLP) models to unlock information embedded in clinical text, but it is also time consuming and expensive. Consequently, it important to identify factors that may affect annotation time, such as syntactic complexity of the text- to-be-annotated and the vagaries of individual user behavior. However, limited work has been done to understand annotation of clinical text. In this study, we aimed to investigate how factors inherent to the text affect annotation time for a named entity recognition (NER) task. We recruited 9 users to annotate a clinical corpus and recorded annotation time for each sample. Then we defined a set of factors that we hypothesized might affect annotation time, and fitted them into a linear regression model to predict annotation time. The linear regression model achieved an R of 0.611, and revealed eight time-associated factors, including characteristics of sentences, individual users, and annotation order with implications for the practice of annotation, and the development of cost models for active learning research.
构建高质量的带注释临床语料库对于开发统计自然语言处理(NLP)模型以挖掘临床文本中嵌入的信息是必要的,但这也既耗时又昂贵。因此,识别可能影响注释时间的因素很重要,比如待注释文本的句法复杂性以及个体用户行为的变幻莫测。然而,在理解临床文本注释方面所做的工作有限。在本研究中,我们旨在调查文本的内在因素如何影响命名实体识别(NER)任务的注释时间。我们招募了9名用户来注释一个临床语料库,并记录每个样本的注释时间。然后我们定义了一组我们假设可能影响注释时间的因素,并将它们纳入线性回归模型以预测注释时间。线性回归模型的R值为0.611,并揭示了八个与时间相关的因素,包括句子特征、个体用户以及注释顺序,这些因素对注释实践以及主动学习研究的成本模型开发具有启示意义。