Garg Ravi, Oh Elissa, Naidech Andrew, Kording Konrad, Prabhakaran Shyam
Department of Neurology, Northwestern University, Feinberg School of Medicine, Chicago, Illinois.
University of Pennsylvania, Philadelphia, Pennsylvania.
J Stroke Cerebrovasc Dis. 2019 Jul;28(7):2045-2051. doi: 10.1016/j.jstrokecerebrovasdis.2019.02.004. Epub 2019 May 15.
The manual adjudication of disease classification is time-consuming, error-prone, and limits scaling to large datasets. In ischemic stroke (IS), subtype classification is critical for management and outcome prediction. This study sought to use natural language processing of electronic health records (EHR) combined with machine learning methods to automate IS subtyping.
Among IS patients from an observational registry with TOAST subtyping adjudicated by board-certified vascular neurologists, we analyzed unstructured text-based EHR data including neurology progress notes and neuroradiology reports using natural language processing. We performed several feature selection methods to reduce the high dimensionality of the features and 5-fold cross validation to test generalizability of our methods and minimize overfitting. We used several machine learning methods and calculated the kappa values for agreement between each machine learning approach to manual adjudication. We then performed a blinded testing of the best algorithm against a held-out subset of 50 cases.
Compared to manual classification, the best machine-based classification achieved a kappa of .25 using radiology reports alone, .57 using progress notes alone, and .57 using combined data. Kappa values varied by subtype being highest for cardioembolic (.64) and lowest for cryptogenic cases (.47). In the held-out test subset, machine-based classification agreed with rater classification in 40 of 50 cases (kappa .72).
Automated machine learning approaches using textual data from the EHR shows agreement with manual TOAST classification. The automated pipeline, if externally validated, could enable large-scale stroke epidemiology research.
疾病分类的人工判定耗时、易出错,且限制了对大型数据集的扩展。在缺血性卒中(IS)中,亚型分类对于治疗管理和预后预测至关重要。本研究旨在利用电子健康记录(EHR)的自然语言处理技术结合机器学习方法,实现IS亚型分类的自动化。
在一个观察性登记研究的IS患者中,由获得委员会认证的血管神经科医生对其进行TOAST亚型判定,我们使用自然语言处理技术分析了基于文本的非结构化EHR数据,包括神经科病程记录和神经放射学报告。我们采用了几种特征选择方法来降低特征的高维度,并进行5折交叉验证以测试我们方法的通用性并最小化过拟合。我们使用了几种机器学习方法,并计算了每种机器学习方法与人工判定之间一致性的kappa值。然后,我们对最佳算法针对50例预留病例进行了盲法测试。
与人工分类相比,最佳的基于机器的分类单独使用放射学报告时kappa值为0.25,单独使用病程记录时为0.57,使用组合数据时为0.57。kappa值因亚型而异,心源性栓塞型最高(0.64),隐源性病例最低(0.47)。在预留测试子集中,基于机器的分类在50例中有40例与评估者分类一致(kappa值为0.72)。
使用EHR文本数据的自动化机器学习方法与人工TOAST分类显示出一致性。如果经过外部验证,这种自动化流程可用于大规模卒中流行病学研究。