Ramanan S V, Radhakrishna Kedar, Waghmare Abijeet, Raj Tony, Nathan Senthil P, Sreerama Sai Madhukar, Sampath Sriram
RelAgent Technologies (P) Limited, IIT Madras Research Park, #14, 1st Floor, Taramani, Chennai, 600113, India.
Division of Medical Informatics, St. John's Research Institute, 100 Feet Road, Koramangala, Bangalore, 560034, India.
J Med Syst. 2016 Aug;40(8):187. doi: 10.1007/s10916-016-0541-2. Epub 2016 Jun 24.
Electronic Health Record (EHR) use in India is generally poor, and structured clinical information is mostly lacking. This work is the first attempt aimed at evaluating unstructured text mining for extracting relevant clinical information from Indian clinical records. We annotated a corpus of 250 discharge summaries from an Intensive Care Unit (ICU) in India, with markups for diseases, procedures, and lab parameters, their attributes, as well as key demographic information and administrative variables such as patient outcomes. In this process, we have constructed guidelines for an annotation scheme useful to clinicians in the Indian context. We evaluated the performance of an NLP engine, Cocoa, on a cohort of these Indian clinical records. We have produced an annotated corpus of roughly 90 thousand words, which to our knowledge is the first tagged clinical corpus from India. Cocoa was evaluated on a test corpus of 50 documents. The overlap F-scores across the major categories, namely disease/symptoms, procedures, laboratory parameters and outcomes, are 0.856, 0.834, 0.961 and 0.872 respectively. These results are competitive with results from recent shared tasks based on US records. The annotated corpus and associated results from the Cocoa engine indicate that unstructured text mining is a viable method for cohort analysis in the Indian clinical context, where structured EHR records are largely absent.
电子健康记录(EHR)在印度的使用情况普遍不佳,且大多缺乏结构化临床信息。这项工作是首次尝试评估非结构化文本挖掘,以从印度临床记录中提取相关临床信息。我们对来自印度一家重症监护病房(ICU)的250份出院小结语料库进行了注释,标注了疾病、手术、实验室参数、它们的属性,以及关键人口统计学信息和行政变量,如患者预后。在此过程中,我们构建了一套注释方案指南,对印度背景下的临床医生很有用。我们在这些印度临床记录的一个队列上评估了自然语言处理引擎Cocoa的性能。我们生成了一个约9万字的注释语料库,据我们所知,这是来自印度的首个带标签临床语料库。Cocoa在50份文档的测试语料库上进行了评估。在主要类别(即疾病/症状、手术、实验室参数和预后)上的重叠F值分别为0.856、0.834、0.961和0.872。这些结果与基于美国记录的近期共享任务结果具有竞争力。注释语料库以及Cocoa引擎的相关结果表明,在印度临床环境中,非结构化文本挖掘是一种可行的队列分析方法,因为那里结构化EHR记录基本不存在。