Zhao Sizheng Steven, Hong Chuan, Cai Tianrun, Xu Chang, Huang Jie, Ermann Joerg, Goodson Nicola J, Solomon Daniel H, Cai Tianxi, Liao Katherine P
Institute of Ageing and Chronic Disease, University of Liverpool.
Department of Academic Rheumatology, Aintree University Hospital, Liverpool, UK.
Rheumatology (Oxford). 2020 May 1;59(5):1059-1065. doi: 10.1093/rheumatology/kez375.
To develop classification algorithms that accurately identify axial SpA (axSpA) patients in electronic health records, and compare the performance of algorithms incorporating free-text data against approaches using only International Classification of Diseases (ICD) codes.
An enriched cohort of 7853 eligible patients was created from electronic health records of two large hospitals using automated searches (⩾1 ICD codes combined with simple text searches). Key disease concepts from free-text data were extracted using NLP and combined with ICD codes to develop algorithms. We created both supervised regression-based algorithms-on a training set of 127 axSpA cases and 423 non-cases-and unsupervised algorithms to identify patients with high probability of having axSpA from the enriched cohort. Their performance was compared against classifications using ICD codes only.
NLP extracted four disease concepts of high predictive value: ankylosing spondylitis, sacroiliitis, HLA-B27 and spondylitis. The unsupervised algorithm, incorporating both the NLP concept and ICD code for AS, identified the greatest number of patients. By setting the probability threshold to attain 80% positive predictive value, it identified 1509 axSpA patients (mean age 53 years, 71% male). Sensitivity was 0.78, specificity 0.94 and area under the curve 0.93. The two supervised algorithms performed similarly but identified fewer patients. All three outperformed traditional approaches using ICD codes alone (area under the curve 0.80-0.87).
Algorithms incorporating free-text data can accurately identify axSpA patients in electronic health records. Large cohorts identified using these novel methods offer exciting opportunities for future clinical research.
开发能够在电子健康记录中准确识别轴性脊柱关节炎(axSpA)患者的分类算法,并比较纳入自由文本数据的算法与仅使用国际疾病分类(ICD)编码的方法的性能。
通过自动化搜索(⩾1个ICD编码与简单文本搜索相结合),从两家大型医院的电子健康记录中创建了一个由7853名符合条件的患者组成的丰富队列。使用自然语言处理(NLP)从自由文本数据中提取关键疾病概念,并与ICD编码相结合以开发算法。我们创建了基于监督回归的算法(在一个包含127例axSpA病例和423例非病例的训练集上)以及无监督算法,以从丰富队列中识别出患有axSpA可能性高的患者。将它们的性能与仅使用ICD编码的分类进行比较。
NLP提取了四个具有高预测价值的疾病概念:强直性脊柱炎、骶髂关节炎、HLA - B27和脊柱炎。结合了NLP概念和AS的ICD编码的无监督算法识别出的患者数量最多。通过将概率阈值设置为达到80%的阳性预测值,它识别出1509例axSpA患者(平均年龄53岁,71%为男性)。敏感性为0.78,特异性为0.94,曲线下面积为0.93。两种监督算法表现相似,但识别出的患者较少。所有三种算法的表现均优于仅使用ICD编码的传统方法(曲线下面积为0.80 - 0.87)。
纳入自由文本数据的算法能够在电子健康记录中准确识别axSpA患者。使用这些新方法识别出的大型队列可为未来的临床研究提供令人兴奋的机会。