Zhai Haijun, Lingren Todd, Deleger Louise, Li Qi, Kaiser Megan, Stoutenborough Laura, Solti Imre
Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA.
J Med Internet Res. 2013 Apr 2;15(4):e73. doi: 10.2196/jmir.2426.
BACKGROUND: A high-quality gold standard is vital for supervised, machine learning-based, clinical natural language processing (NLP) systems. In clinical NLP projects, expert annotators traditionally create the gold standard. However, traditional annotation is expensive and time-consuming. To reduce the cost of annotation, general NLP projects have turned to crowdsourcing based on Web 2.0 technology, which involves submitting smaller subtasks to a coordinated marketplace of workers on the Internet. Many studies have been conducted in the area of crowdsourcing, but only a few have focused on tasks in the general NLP field and only a handful in the biomedical domain, usually based upon very small pilot sample sizes. In addition, the quality of the crowdsourced biomedical NLP corpora were never exceptional when compared to traditionally-developed gold standards. The previously reported results on medical named entity annotation task showed a 0.68 F-measure based agreement between crowdsourced and traditionally-developed corpora. OBJECTIVE: Building upon previous work from the general crowdsourcing research, this study investigated the usability of crowdsourcing in the clinical NLP domain with special emphasis on achieving high agreement between crowdsourced and traditionally-developed corpora. METHODS: To build the gold standard for evaluating the crowdsourcing workers' performance, 1042 clinical trial announcements (CTAs) from the ClinicalTrials.gov website were randomly selected and double annotated for medication names, medication types, and linked attributes. For the experiments, we used CrowdFlower, an Amazon Mechanical Turk-based crowdsourcing platform. We calculated sensitivity, precision, and F-measure to evaluate the quality of the crowd's work and tested the statistical significance (P<.001, chi-square test) to detect differences between the crowdsourced and traditionally-developed annotations. RESULTS: The agreement between the crowd's annotations and the traditionally-generated corpora was high for: (1) annotations (0.87, F-measure for medication names; 0.73, medication types), (2) correction of previous annotations (0.90, medication names; 0.76, medication types), and excellent for (3) linking medications with their attributes (0.96). Simple voting provided the best judgment aggregation approach. There was no statistically significant difference between the crowd and traditionally-generated corpora. Our results showed a 27.9% improvement over previously reported results on medication named entity annotation task. CONCLUSIONS: This study offers three contributions. First, we proved that crowdsourcing is a feasible, inexpensive, fast, and practical approach to collect high-quality annotations for clinical text (when protected health information was excluded). We believe that well-designed user interfaces and rigorous quality control strategy for entity annotation and linking were critical to the success of this work. Second, as a further contribution to the Internet-based crowdsourcing field, we will publicly release the JavaScript and CrowdFlower Markup Language infrastructure code that is necessary to utilize CrowdFlower's quality control and crowdsourcing interfaces for named entity annotations. Finally, to spur future research, we will release the CTA annotations that were generated by traditional and crowdsourced approaches.
背景:高质量的金标准对于基于监督式机器学习的临床自然语言处理(NLP)系统至关重要。在临床NLP项目中,传统上由专家注释者创建金标准。然而,传统注释成本高且耗时。为降低注释成本,一般NLP项目已转向基于Web 2.0技术的众包,即将较小的子任务提交到互联网上由工人组成的协调市场。众包领域已开展了许多研究,但只有少数关注一般NLP领域的任务,而在生物医学领域的研究更是寥寥无几,且通常基于非常小的试点样本量。此外,与众包生物医学NLP语料库相比,传统开发的金标准的质量也并不突出。先前关于医学命名实体注释任务的报告结果显示,众包语料库与传统开发的语料库之间的F值一致性为0.68。 目的:基于一般众包研究的先前工作,本研究调查了众包在临床NLP领域的可用性,特别强调在众包语料库与传统开发的语料库之间实现高度一致性。 方法:为建立评估众包工人表现的金标准,从ClinicalTrials.gov网站随机选择了1042条临床试验公告(CTA),并对药物名称、药物类型和相关属性进行了双重注释。在实验中,我们使用了CrowdFlower,这是一个基于亚马逊Mechanical Turk的众包平台。我们计算了敏感度、精确率和F值来评估众包工作的质量,并通过卡方检验(P<0.001)测试统计显著性,以检测众包注释与传统开发注释之间的差异。 结果:众包注释与传统生成的语料库之间具有较高的一致性,具体表现为:(1)注释(药物名称的F值为0.87;药物类型为0.73),(2)对先前注释的修正(药物名称为0.90;药物类型为0.76),以及(3)药物与其属性的关联(0.96)表现出色。简单投票提供了最佳的判断汇总方法。众包语料库与传统生成的语料库之间没有统计学上的显著差异。我们的结果显示,与先前关于药物命名实体注释任务的报告结果相比,有27.9%的提升。 结论:本研究有三点贡献。首先,我们证明了众包是一种可行、廉价、快速且实用的方法,可用于为临床文本收集高质量注释(排除受保护的健康信息时)。我们认为,精心设计的用户界面以及针对实体注释和链接的严格质量控制策略对这项工作的成功至关重要。其次,作为对基于互联网众包领域的进一步贡献,我们将公开发布利用CrowdFlower的质量控制和众包接口进行命名实体注释所需的JavaScript和CrowdFlower标记语言基础设施代码。最后,为推动未来研究,我们将发布通过传统方法和众包方法生成的CTA注释。
J Med Internet Res. 2013-4-2
Pac Symp Biocomput. 2015
J Med Internet Res. 2019-5-23
J Biomed Inform. 2015-12
Artif Intell Med. 2015-10
iScience. 2021-2-6
BMC Bioinformatics. 2019-4-29
BMC Med Inform Decis Mak. 2018-7-23
J Med Internet Res. 2018-5-15
J Glob Health. 2018-6
Database (Oxford). 2016-8-7
PLoS One. 2016-6-16
J Am Med Inform Assoc. 2013-1-25
AMIA Annu Symp Proc. 2012
J Med Internet Res. 2012-11-29
J Med Internet Res. 2012-6-4
AMIA Annu Symp Proc. 2010
Sci Transl Med. 2011-6-22