Kavuluru Ramakanth, Han Sifei, Harris Daniel
Division of Biomedical Informatics, Department of Biostatistics, University of Kentucky, Lexington, KY.
Department of Computer Science, University of Kentucky, Lexington, KY.
Adv Artif Intell. 2013 May;7884:77-88. doi: 10.1007/978-3-642-38457-8_7.
Diagnosis codes are extracted from medical records for billing and reimbursement and for secondary uses such as quality control and cohort identification. In the US, these codes come from the standard terminology ICD-9-CM derived from the international classification of diseases (ICD). ICD-9 codes are generally extracted by trained human coders by reading all artifacts available in a patient's medical record following specific coding guidelines. To assist coders in this manual process, this paper proposes an unsupervised ensemble approach to automatically extract ICD-9 diagnosis codes from textual narratives included in electronic medical records (EMRs). Earlier attempts on automatic extraction focused on individual documents such as radiology reports and discharge summaries. Here we use a more realistic dataset and extract ICD-9 codes from EMRs of 1000 inpatient visits at the University of Kentucky Medical Center. Using named entity recognition (NER), graph-based concept-mapping of medical concepts, and extractive text summarization techniques, we achieve an example based average recall of 0.42 with average precision 0.47; compared with a baseline of using only NER, we notice a 12% improvement in recall with the graph-based approach and a 7% improvement in precision using the extractive text summarization approach. Although diagnosis codes are complex concepts often expressed in text with significant long range non-local dependencies, our present work shows the potential of unsupervised methods in extracting a portion of codes. As such, our findings are especially relevant for code extraction tasks where obtaining large amounts of training data is difficult.
诊断代码从医疗记录中提取,用于计费和报销以及质量控制和队列识别等二次用途。在美国,这些代码来自源自国际疾病分类(ICD)的标准术语ICD-9-CM。ICD-9代码通常由经过培训的人工编码员按照特定的编码指南阅读患者医疗记录中的所有文档来提取。为了在这个手动过程中帮助编码员,本文提出了一种无监督集成方法,用于从电子病历(EMR)中的文本叙述中自动提取ICD-9诊断代码。早期的自动提取尝试集中在单个文档上,如放射学报告和出院小结。在这里,我们使用了一个更现实的数据集,并从肯塔基大学医学中心1000次住院就诊的电子病历中提取ICD-9代码。使用命名实体识别(NER)、基于图的医学概念映射和提取式文本摘要技术,我们实现了基于示例的平均召回率为0.42,平均精确率为0.47;与仅使用NER的基线相比,我们注意到基于图的方法召回率提高了12%,使用提取式文本摘要方法精确率提高了7%。尽管诊断代码是复杂的概念,通常在文本中表达,具有显著的长距离非局部依赖性,但我们目前的工作显示了无监督方法在提取一部分代码方面的潜力。因此,我们的发现对于难以获得大量训练数据的代码提取任务尤其相关。