Hardjojo Antony, Gunachandran Arunan, Pang Long, Abdullah Mohammed Ridzwan Bin, Wah Win, Chong Joash Wen Chen, Goh Ee Hui, Teo Sok Huang, Lim Gilbert, Lee Mong Li, Hsu Wynne, Lee Vernon, Chen Mark I-Cheng, Wong Franco, Phang Jonathan Siung King
Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore.
National Healthcare Group Polyclinics, Singapore, Singapore.
JMIR Med Inform. 2018 Jun 11;6(2):e36. doi: 10.2196/medinform.8204.
Free-text clinical records provide a source of information that complements traditional disease surveillance. To electronically harness these records, they need to be transformed into codified fields by natural language processing algorithms.
The aim of this study was to develop, train, and validate Clinical History Extractor for Syndromic Surveillance (CHESS), an natural language processing algorithm to extract clinical information from free-text primary care records.
CHESS is a keyword-based natural language processing algorithm to extract 48 signs and symptoms suggesting respiratory infections, gastrointestinal infections, constitutional, as well as other signs and symptoms potentially associated with infectious diseases. The algorithm also captured the assertion status (affirmed, negated, or suspected) and symptom duration. Electronic medical records from the National Healthcare Group Polyclinics, a major public sector primary care provider in Singapore, were randomly extracted and manually reviewed by 2 human reviewers, with a third reviewer as the adjudicator. The algorithm was evaluated based on 1680 notes against the human-coded result as the reference standard, with half of the data used for training and the other half for validation.
The symptoms most commonly present within the 1680 clinical records at the episode level were those typically present in respiratory infections such as cough (744/7703, 9.66%), sore throat (591/7703, 7.67%), rhinorrhea (552/7703, 7.17%), and fever (928/7703, 12.04%). At the episode level, CHESS had an overall performance of 96.7% precision and 97.6% recall on the training dataset and 96.0% precision and 93.1% recall on the validation dataset. Symptoms suggesting respiratory and gastrointestinal infections were all detected with more than 90% precision and recall. CHESS correctly assigned the assertion status in 97.3%, 97.9%, and 89.8% of affirmed, negated, and suspected signs and symptoms, respectively (97.6% overall accuracy). Symptom episode duration was correctly identified in 81.2% of records with known duration status.
We have developed an natural language processing algorithm dubbed CHESS that achieves good performance in extracting signs and symptoms from primary care free-text clinical records. In addition to the presence of symptoms, our algorithm can also accurately distinguish affirmed, negated, and suspected assertion statuses and extract symptom durations.
自由文本临床记录提供了补充传统疾病监测的信息来源。为了以电子方式利用这些记录,需要通过自然语言处理算法将它们转换为编码字段。
本研究的目的是开发、训练和验证用于症状监测的临床病史提取器(CHESS),这是一种从自由文本初级保健记录中提取临床信息的自然语言处理算法。
CHESS是一种基于关键字的自然语言处理算法,用于提取48种提示呼吸道感染、胃肠道感染、全身性症状以及其他可能与传染病相关的体征和症状。该算法还捕捉了断言状态(肯定、否定或疑似)和症状持续时间。从新加坡主要的公共部门初级保健提供者国家医疗集团综合诊所随机提取电子病历,并由两名人工审阅者进行人工审阅,第三名审阅者作为裁决者。以人工编码结果作为参考标准,基于1680份记录对该算法进行评估,其中一半数据用于训练,另一半用于验证。
在1680份临床记录中,发作水平上最常见的症状是呼吸道感染中通常出现的症状,如咳嗽(744/7703,9.66%)、喉咙痛(591/7703,7.67%)、流鼻涕(552/7703,7.17%)和发烧(928/7703,12.04%)。在发作水平上,CHESS在训练数据集上的总体性能为精确率96.7%、召回率97.6%,在验证数据集上的精确率为96.0%、召回率为93.1%。提示呼吸道和胃肠道感染的症状检测精确率和召回率均超过90%。CHESS分别在97.3%、97.9%和89.8%的肯定、否定和疑似体征和症状中正确分配了断言状态(总体准确率97.6%)。在已知持续时间状态的记录中,81.2%的记录正确识别了症状发作持续时间。
我们开发了一种名为CHESS的自然语言处理算法,该算法在从初级保健自由文本临床记录中提取体征和症状方面表现良好。除了症状的存在外,我们的算法还可以准确区分肯定、否定和疑似断言状态,并提取症状持续时间。