Zhan Xianghao, Humbert-Droz Marie, Mukherjee Pritam, Gevaert Olivier
Department of Bioengineering, Stanford University, Stanford, CA 94305, USA.
Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, Stanford, CA 94305, USA.
Patterns (N Y). 2021 Jun 17;2(7):100289. doi: 10.1016/j.patter.2021.100289. eCollection 2021 Jul 9.
Free-text clinical notes in electronic health records are more difficult for data mining while the structured diagnostic codes can be missing or erroneous. To improve the quality of diagnostic codes, this work extracts diagnostic codes from free-text notes: five old and new word vectorization methods were used to vectorize Stanford progress notes and predict eight ICD-10 codes of common cardiovascular diseases with logistic regression. The models showed good performance, with TF-IDF as the best vectorization model showing the highest AUROC (0.9499-0.9915) and AUPRC (0.2956-0.8072). The models also showed transferability when tested on MIMIC-III data with AUROC from 0.7952 to 0.9790 and AUPRC from 0.2353 to 0.8084. Model interpretability was shown by the important words with clinical meanings matching each disease. This study shows the feasibility of accurately extracting structured diagnostic codes, imputing missing codes, and correcting erroneous codes from free-text clinical notes for information retrieval and downstream machine-learning applications.
电子健康记录中的自由文本临床笔记对于数据挖掘来说难度更大,而结构化诊断代码可能会缺失或有误。为了提高诊断代码的质量,这项工作从自由文本笔记中提取诊断代码:使用了五种新旧词向量量化方法对斯坦福进展记录进行向量量化,并通过逻辑回归预测常见心血管疾病的八个ICD - 10代码。各模型表现良好,以TF - IDF作为最佳向量量化模型,其显示出最高的曲线下面积(AUROC,0.9499 - 0.9915)和精确率均值与召回率均值曲线下面积(AUPRC,0.2956 - 0.8072)。在MIMIC - III数据上进行测试时,各模型也显示出可迁移性,AUROC为0.7952至0.9790,AUPRC为0.2353至0.8084。具有临床意义的重要词语与每种疾病相匹配,从而展示了模型的可解释性。本研究表明,从自由文本临床笔记中准确提取结构化诊断代码、插补缺失代码以及纠正错误代码用于信息检索和下游机器学习应用是可行的。