Divita G, Carter M, Redd A, Zeng Q, Gupta K, Trautner B, Samore M, Gundlapalli A
Guy Divita, University of Utah School of Medicine, Division of Epidemiology, 295 Chipeta Way, Salt Lake City, UT 84132, USA, E-mail:
Methods Inf Med. 2015;54(6):548-52. doi: 10.3414/ME14-02-0018. Epub 2015 Nov 4.
This article is part of the Focus Theme of Methods of Information in Medicine on "Big Data and Analytics in Healthcare".
This paper describes the scale-up efforts at the VA Salt Lake City Health Care System to address processing large corpora of clinical notes through a natural language processing (NLP) pipeline. The use case described is a current project focused on detecting the presence of an indwelling urinary catheter in hospitalized patients and subsequent catheter-associated urinary tract infections.
An NLP algorithm using v3NLP was developed to detect the presence of an indwelling urinary catheter in hospitalized patients. The algorithm was tested on a small corpus of notes on patients for whom the presence or absence of a catheter was already known (reference standard). In planning for a scale-up, we estimated that the original algorithm would have taken 2.4 days to run on a larger corpus of notes for this project (550,000 notes), and 27 days for a corpus of 6 million records representative of a national sample of notes. We approached scaling-up NLP pipelines through three techniques: pipeline replication via multi-threading, intra-annotator threading for tasks that can be further decomposed, and remote annotator services which enable annotator scale-out.
The scale-up resulted in reducing the average time to process a record from 206 milliseconds to 17 milliseconds or a 12- fold increase in performance when applied to a corpus of 550,000 notes.
Purposely simplistic in nature, these scale-up efforts are the straight forward evolution from small scale NLP processing to larger scale extraction without incurring associated complexities that are inherited by the use of the underlying UIMA framework. These efforts represent generalizable and widely applicable techniques that will aid other computationally complex NLP pipelines that are of need to be scaled out for processing and analyzing big data.
本文是《医学信息方法》关于“医疗保健中的大数据与分析”重点主题的一部分。
本文描述了盐湖城退伍军人医疗保健系统为通过自然语言处理(NLP)管道处理大量临床记录所做的扩大规模努力。所描述的用例是当前一个专注于检测住院患者留置导尿管的存在以及随后的导尿管相关尿路感染的项目。
开发了一种使用v3NLP的NLP算法来检测住院患者留置导尿管的存在。该算法在一小部分已知是否存在导尿管的患者记录语料库(参考标准)上进行了测试。在规划扩大规模时,我们估计原始算法在处理该项目的更大记录语料库(550,000条记录)时需要2.4天运行,而处理代表全国记录样本的600万条记录的语料库则需要27天。我们通过三种技术来扩大NLP管道规模:通过多线程进行管道复制、对可进一步分解的任务进行注释器内线程处理以及启用注释器扩展的远程注释器服务。
扩大规模后,处理一条记录的平均时间从206毫秒减少到17毫秒,应用于550,000条记录的语料库时性能提高了12倍。
这些扩大规模的努力本质上有意简化,是从小规模NLP处理到大规模提取的直接演进,不会产生使用底层UIMA框架所带来的相关复杂性。这些努力代表了可推广且广泛适用的技术,将有助于其他需要扩大规模以处理和分析大数据的计算复杂的NLP管道。