Zhou Shang-Ming, Fernandez-Gutierrez Fabiola, Kennedy Jonathan, Cooksey Roxanne, Atkinson Mark, Denaxas Spiros, Siebert Stefan, Dixon William G, O'Neill Terence W, Choy Ernest, Sudlow Cathie, Brophy Sinead
Institute of Life Science, College of Medicine, Swansea University, Swansea, United Kingdom.
UCL Institute of Health Informatics and Farr Institute of Health Informatics Research, London, United Kingdom.
PLoS One. 2016 May 2;11(5):e0154515. doi: 10.1371/journal.pone.0154515. eCollection 2016.
This study linked routine primary and secondary care EHRs in Wales, UK. A machine learning based scheme was used to identify patients with rheumatoid arthritis from primary care EHRs via the following steps: i) selection of variables by comparing relative frequencies of Read codes in the primary care dataset associated with disease case compared to non-disease control (disease/non-disease based on the secondary care diagnosis); ii) reduction of predictors/associated variables using a Random Forest method, iii) induction of decision rules from decision tree model. The proposed method was then extensively validated on an independent dataset, and compared for performance with two existing deterministic algorithms for RA which had been developed using expert clinical knowledge.
Primary care EHRs were available for 2,238,360 patients over the age of 16 and of these 20,667 were also linked in the secondary care rheumatology clinical system. In the linked dataset, 900 predictors (out of a total of 43,100 variables) in the primary care record were discovered more frequently in those with versus those without RA. These variables were reduced to 37 groups of related clinical codes, which were used to develop a decision tree model. The final algorithm identified 8 predictors related to diagnostic codes for RA, medication codes, such as those for disease modifying anti-rheumatic drugs, and absence of alternative diagnoses such as psoriatic arthritis. The proposed data-driven method performed as well as the expert clinical knowledge based methods.
Data-driven scheme, such as ensemble machine learning methods, has the potential of identifying the most informative predictors in a cost-effective and rapid way to accurately and reliably classify rheumatoid arthritis or other complex medical conditions in primary care EHRs.
1)运用数据驱动方法,在初级医疗电子健康记录(EHR)中检查某种医疗状况的临床编码(风险因素),以便准确预测二级医疗EHR中的该疾病诊断。2)利用初级医疗EHR开发并验证类风湿性关节炎的疾病表型算法。
本研究将英国威尔士的常规初级和二级医疗EHR相链接。通过以下步骤,采用基于机器学习的方案从初级医疗EHR中识别类风湿性关节炎患者:i)通过比较与疾病病例相关的初级医疗数据集中Read编码的相对频率与非疾病对照(基于二级医疗诊断的疾病/非疾病)来选择变量;ii)使用随机森林方法减少预测变量/相关变量;iii)从决策树模型中归纳决策规则。然后,在一个独立数据集上对所提出的方法进行广泛验证,并将其性能与另外两种使用专家临床知识开发的现有类风湿性关节炎确定性算法进行比较。
有16岁以上的2,238,360名患者的初级医疗EHR可用,其中20,667名患者也与二级医疗风湿病临床系统相链接。在链接数据集中,在患有类风湿性关节炎的患者中,初级医疗记录中的900个预测变量(总共43,100个变量)比未患该病的患者中出现得更频繁。这些变量被缩减为37组相关临床编码,用于开发决策树模型。最终算法识别出8个与类风湿性关节炎诊断编码、药物编码(如改善病情抗风湿药的编码)以及不存在诸如银屑病关节炎等替代诊断相关的预测变量。所提出的数据驱动方法与基于专家临床知识的方法表现相当。
诸如集成机器学习方法之类的数据驱动方案,有潜力以经济高效且快速的方式识别最具信息价值的预测变量,从而在初级医疗EHR中准确可靠地对类风湿性关节炎或其他复杂医疗状况进行分类。