IEEE J Biomed Health Inform. 2018 Jul;22(4):1323-1329. doi: 10.1109/JBHI.2017.2743824. Epub 2017 Aug 24.
This work focuses on data mining applied to the clinical documentation domain. Diagnostic terms (DTs) are used as keywords to retrieve valuable information from electronic health records. Indeed, they are encoded manually by experts following the International Classification of Diseases (ICD). The goal of this work is to explore the aid of text mining on DT encoding. From the machine learning (ML) perspective, this is a high-dimensional classification task, as it comprises thousands of codes. This work delves into a robust representation of the instances to improve ML results. The proposed system is able to find the right ICD code among more than 1500 possible ICD codes with 92% precision for the main disease (primary class) and 88% for the main disease together with the nonessential modifiers (fully specified class). The methodology employed is simple and portable. According to the experts from public hospitals, the system is very useful in particular for documentation and pharmacosurveillance services. In fact, they reported an accuracy of 91.2% on a small randomly extracted test. Hence, together with this paper, we made the software publicly available in order to help the clinical and research community.
这项工作专注于应用于临床文档领域的数据挖掘。诊断术语 (DT) 被用作从电子健康记录中检索有价值信息的关键字。实际上,它们是由专家根据国际疾病分类 (ICD) 手动编码的。这项工作旨在探索文本挖掘在 DT 编码方面的辅助作用。从机器学习 (ML) 的角度来看,这是一项高维分类任务,因为它包含数千个代码。这项工作深入研究了实例的稳健表示,以提高 ML 结果。所提出的系统能够在超过 1500 种可能的 ICD 代码中找到正确的 ICD 代码,对于主要疾病 (主要类别) 的准确率为 92%,对于主要疾病和非必要修饰符 (完全指定类别) 的准确率为 88%。所采用的方法简单且可移植。根据公立医院的专家的说法,该系统对于文档和药物监测服务特别有用。事实上,他们在一个小的随机提取测试中报告了 91.2%的准确率。因此,我们与本文一起将该软件公开提供,以帮助临床和研究界。