Canales Lea, Menke Sebastian, Marchesseau Stephanie, D'Agostino Ariel, Del Rio-Bermudez Carlos, Taberna Miren, Tello Jorge
Department of Software and Computing System, University of Alicante, Alicante, Spain.
MedSavana SL, Madrid, Spain.
JMIR Med Inform. 2021 Jul 23;9(7):e20492. doi: 10.2196/20492.
Clinical natural language processing (cNLP) systems are of crucial importance due to their increasing capability in extracting clinically important information from free text contained in electronic health records (EHRs). The conversion of a nonstructured representation of a patient's clinical history into a structured format enables medical doctors to generate clinical knowledge at a level that was not possible before. Finally, the interpretation of the insights gained provided by cNLP systems has a great potential in driving decisions about clinical practice. However, carrying out robust evaluations of those cNLP systems is a complex task that is hindered by a lack of standard guidance on how to systematically approach them.
Our objective was to offer natural language processing (NLP) experts a methodology for the evaluation of cNLP systems to assist them in carrying out this task. By following the proposed phases, the robustness and representativeness of the performance metrics of their own cNLP systems can be assured.
The proposed evaluation methodology comprised five phases: (1) the definition of the target population, (2) the statistical document collection, (3) the design of the annotation guidelines and annotation project, (4) the external annotations, and (5) the cNLP system performance evaluation. We presented the application of all phases to evaluate the performance of a cNLP system called "EHRead Technology" (developed by Savana, an international medical company), applied in a study on patients with asthma. As part of the evaluation methodology, we introduced the Sample Size Calculator for Evaluations (SLiCE), a software tool that calculates the number of documents needed to achieve a statistically useful and resourceful gold standard.
The application of the proposed evaluation methodology on a real use-case study of patients with asthma revealed the benefit of the different phases for cNLP system evaluations. By using SLiCE to adjust the number of documents needed, a meaningful and resourceful gold standard was created. In the presented use-case, using as little as 519 EHRs, it was possible to evaluate the performance of the cNLP system and obtain performance metrics for the primary variable within the expected CIs.
We showed that our evaluation methodology can offer guidance to NLP experts on how to approach the evaluation of their cNLP systems. By following the five phases, NLP experts can assure the robustness of their evaluation and avoid unnecessary investment of human and financial resources. Besides the theoretical guidance, we offer SLiCE as an easy-to-use, open-source Python library.
临床自然语言处理(cNLP)系统至关重要,因为其从电子健康记录(EHR)中的自由文本提取临床重要信息的能力不断增强。将患者临床病史的非结构化表示转换为结构化格式,使医生能够生成前所未有的临床知识。最后,对cNLP系统所提供见解的解读在推动临床实践决策方面具有巨大潜力。然而,对这些cNLP系统进行有力评估是一项复杂任务,因缺乏关于如何系统评估它们的标准指南而受阻。
我们的目标是为自然语言处理(NLP)专家提供一种评估cNLP系统的方法,以协助他们完成这项任务。通过遵循提议的阶段,可以确保其自身cNLP系统性能指标的稳健性和代表性。
提议的评估方法包括五个阶段:(1)目标人群的定义;(2)统计文档收集;(3)注释指南和注释项目的设计;(4)外部注释;(5)cNLP系统性能评估。我们展示了所有阶段在评估一个名为“EHRead Technology”(由国际医疗公司Savana开发)的cNLP系统性能中的应用,该系统应用于一项针对哮喘患者的研究。作为评估方法的一部分,我们引入了评估样本量计算器(SLiCE),这是一个软件工具,用于计算达到统计上有用且资源充足的金标准所需的文档数量。
将提议的评估方法应用于哮喘患者的实际案例研究,揭示了不同阶段对cNLP系统评估的益处。通过使用SLiCE调整所需文档数量,创建了一个有意义且资源充足的金标准。在呈现的案例中,仅使用519份EHR,就有可能评估cNLP系统的性能并获得预期置信区间内主要变量的性能指标。
我们表明,我们的评估方法可以为NLP专家提供有关如何评估其cNLP系统的指导。通过遵循这五个阶段,NLP专家可以确保评估的稳健性,避免人力和财力资源的不必要投入。除了理论指导外,我们还提供SLiCE作为一个易于使用的开源Python库。