Turner Clayton A, Jacobs Alexander D, Marques Cassios K, Oates James C, Kamen Diane L, Anderson Paul E, Obeid Jihad S
Department of Computer Science, College of Charleston, 66 George Street, Charleston, 29424, USA.
Department of Public Health Sciences, Medical University of South Carolina, 135 Cannon Street, Charleston, 29425, USA.
BMC Med Inform Decis Mak. 2017 Aug 22;17(1):126. doi: 10.1186/s12911-017-0518-1.
Identifying patients with certain clinical criteria based on manual chart review of doctors' notes is a daunting task given the massive amounts of text notes in the electronic health records (EHR). This task can be automated using text classifiers based on Natural Language Processing (NLP) techniques along with pattern recognition machine learning (ML) algorithms. The aim of this research is to evaluate the performance of traditional classifiers for identifying patients with Systemic Lupus Erythematosus (SLE) in comparison with a newer Bayesian word vector method.
We obtained clinical notes for patients with SLE diagnosis along with controls from the Rheumatology Clinic (662 total patients). Sparse bag-of-words (BOWs) and Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) matrices were produced using NLP pipelines. These matrices were subjected to several different NLP classifiers: neural networks, random forests, naïve Bayes, support vector machines, and Word2Vec inversion, a Bayesian inversion method. Performance was measured by calculating accuracy and area under the Receiver Operating Characteristic (ROC) curve (AUC) of a cross-validated (CV) set and a separate testing set.
We calculated the accuracy of the ICD-9 billing codes as a baseline to be 90.00% with an AUC of 0.900, the shallow neural network with CUIs to be 92.10% with an AUC of 0.970, the random forest with BOWs to be 95.25% with an AUC of 0.994, the random forest with CUIs to be 95.00% with an AUC of 0.979, and the Word2Vec inversion to be 90.03% with an AUC of 0.905.
Our results suggest that a shallow neural network with CUIs and random forests with both CUIs and BOWs are the best classifiers for this lupus phenotyping task. The Word2Vec inversion method failed to significantly beat the ICD-9 code classification, but yielded promising results. This method does not require explicit features and is more adaptable to non-binary classification tasks. The Word2Vec inversion is hypothesized to become more powerful with access to more data. Therefore, currently, the shallow neural networks and random forests are the desirable classifiers.
鉴于电子健康记录(EHR)中有大量文本记录,通过人工查阅医生笔记以确定符合特定临床标准的患者是一项艰巨的任务。可以使用基于自然语言处理(NLP)技术的文本分类器以及模式识别机器学习(ML)算法来实现这项任务的自动化。本研究的目的是评估传统分类器与一种更新的贝叶斯词向量方法相比,在识别系统性红斑狼疮(SLE)患者方面的性能。
我们从风湿病诊所获取了SLE诊断患者以及对照患者的临床记录(共662名患者)。使用NLP管道生成了稀疏词袋(BOW)和统一医学语言系统(UMLS)概念唯一标识符(CUI)矩阵。这些矩阵被应用于几种不同的NLP分类器:神经网络、随机森林、朴素贝叶斯、支持向量机以及词向量反转(一种贝叶斯反转方法)。通过计算交叉验证(CV)集和单独测试集的准确率以及受试者工作特征(ROC)曲线下面积(AUC)来衡量性能。
我们计算出ICD - 9计费代码作为基线的准确率为90.00%,AUC为0.900;带有CUI的浅层神经网络准确率为92.10%,AUC为0.970;带有BOW的随机森林准确率为95.25%,AUC为0.994;带有CUI的随机森林准确率为95.00%,AUC为0.979;词向量反转准确率为90.03%,AUC为0.905。
我们的结果表明,带有CUI 的浅层神经网络以及同时带有CUI 和BOW 的随机森林是这项狼疮表型分析任务的最佳分类器。词向量反转方法未能显著超越ICD - 9代码分类,但产生了有前景的结果。这种方法不需要明确的特征,并且更适用于非二元分类任务。据推测,随着获取更多数据,词向量反转会变得更强大。因此,目前浅层神经网络和随机森林是理想的分类器。