Lee-St John Terrence J, Kanwar Oshin, Abidi Emna, El Nekidy Wasim, Piechowski-Jozwiak Bartlomiej
Research Department, Cleveland Clinic Abu Dhabi, Abu Dhabi, United Arab Emirates.
Pharmacy Department, Cleveland Clinic Abu Dhabi, Abu Dhabi, United Arab Emirates.
PLOS Digit Health. 2024 Sep 3;3(9):e0000589. doi: 10.1371/journal.pdig.0000589. eCollection 2024 Sep.
This manuscript presents a proof-of-concept for a generalizable strategy, the full algorithm, designed to estimate disease risk using real-world clinical tabular data systems, such as electronic health records (EHR) or claims databases. By integrating classic statistical methods and modern artificial intelligence techniques, this strategy automates the production of a disease prediction model that comprehensively reflects the dynamics contained within the underlying data system. Specifically, the full algorithm parses through every facet of the data (e.g., encounters, diagnoses, procedures, medications, labs, chief complaints, flowsheets, vital signs, demographics, etc.), selects which factors to retain as predictor variables by evaluating the data empirically against statistical criteria, structures and formats the retained data into time-series, trains a neural network-based prediction model, then subsequently applies this model to current patients to generate risk estimates. A distinguishing feature of the proposed strategy is that it produces a self-adaptive prediction system, capable of evolving the prediction mechanism in response to changes within the data: as newly collected data expand/modify the dataset organically, the prediction mechanism automatically evolves to reflect these changes. Moreover, the full algorithm operates without the need for a-priori data curation and aims to harness all informative risk and protective factors within the real-world data. This stands in contrast to traditional approaches, which often rely on highly curated datasets and domain expertise to build static prediction models based solely on well-known risk factors. As a proof-of-concept, we codified the full algorithm and tasked it with estimating 12-month risk of initial stroke or myocardial infarction using our hospital's real-world EHR. A 66-month pseudo-prospective validation was conducted using records from 558,105 patients spanning April 2015 to September 2023, totalling 3,424,060 patient-months. Area under the receiver operating characteristic curve (AUROC) values ranged from .830 to .909, with an improving trend over time. Odds ratios describing model precision for patients 1-100 and 101-200 (when ranked by estimated risk) ranged from 15.3 to 48.1 and 7.2 to 45.0, respectively, with both groups showing improving trends over time. Findings suggest the feasibility of developing high-performing disease risk calculators in the proposed manner.
本手稿展示了一种通用策略的概念验证,即完整算法,旨在使用诸如电子健康记录(EHR)或理赔数据库等真实世界临床表格数据系统来估计疾病风险。通过整合经典统计方法和现代人工智能技术,该策略可自动生成一个疾病预测模型,该模型能全面反映基础数据系统中所包含的动态信息。具体而言,完整算法会剖析数据的各个方面(例如,就诊情况、诊断、手术、用药、实验室检查、主要症状、流程图、生命体征、人口统计学等),通过根据统计标准对数据进行实证评估来选择哪些因素作为预测变量保留下来,将保留的数据整理并格式化成为时间序列,训练一个基于神经网络的预测模型,然后将该模型应用于当前患者以生成风险估计值。所提出策略的一个显著特点是它能产生一个自适应预测系统,该系统能够根据数据中的变化来演变预测机制:随着新收集的数据有机地扩展/修改数据集,预测机制会自动演变以反映这些变化。此外,完整算法无需进行先验数据整理即可运行,旨在利用真实世界数据中所有信息性风险和保护因素。这与传统方法形成对比,传统方法通常依赖高度整理的数据集和领域专业知识来仅基于已知风险因素构建静态预测模型。作为概念验证,我们对完整算法进行了编码,并让其使用我们医院的真实世界EHR来估计首次中风或心肌梗死的12个月风险。我们使用了2015年4月至2023年9月期间558,105名患者的记录进行了为期66个月的伪前瞻性验证,总计3,424,060患者月。受试者操作特征曲线(AUROC)下的面积值范围为0.830至0.909,且随时间呈上升趋势。描述患者1 - 100和101 - 200(按估计风险排名)模型精度的优势比分别为15.3至48.1和7.2至45.0,两组均随时间呈上升趋势。研究结果表明以所提出的方式开发高性能疾病风险计算器是可行的。