Collyer Taya A, Liu Ming, Beare Richard, Andrew Nadine E, Ung David, Carver Alison, Ilomaki Jenni, Bell J Simon, Thrift Amanda G, Rocca Walter A, St Sauver Jennifer L, Lu Alicia, Siostrom Kristy, Moran Chris, Roberts Helene, Chong Trevor T-J, Murray Anne, Ravipati Tanya, O'Bree Bridget, Srikanth Velandai K
National Centre for Healthy Ageing, Frankston, Victoria, Australia.
Peninsula Clinical School, School of Translational Medicine, Monash University, Frankston, Victoria, Australia.
Alzheimers Dement. 2025 May;21(5):e70132. doi: 10.1002/alz.70132.
Identifying individuals with dementia is crucial for prevalence estimation and service planning, but reliable, scalable methods are lacking. We developed novel set algorithms using both structured and unstructured electronic health record (EHR) data, applying Diagnostic and Statistical Manual of Mental Disorders criteria for dementia case identification.
Our cohort (n = 1082) included individuals aged ≥ 60 with dementia identified through specialist clinics and a comparison group without dementia. Clinicians from Australia and the United States informed predictor selection. We developed algorithms through a biostatistics stream for structured data and a natural language processing (NLP) stream for text, synthesizing results via logistic regression.
The final structured model retained 16 variables (area under the receiver operating characteristic curve [AUC] 0.853, specificity 72.2%, sensitivity 80.6%). NLP classifiers (logistic regression, support vector machine, and random forest models) performed comparably. The final, combined model outperformed all others (AUC = 0.951, P < 0.001 for comparison to structured model).
Embedding text-derived insights within algorithms trained on structured medical data significantly enhances dementia identification capacity.
Algorithmic tools for detection of individuals with dementia are available; however, previous work has used heterogeneous case definitions which are not clinically meaningful, and has relied on proxies such as diagnostic codes or medications for case ascertainment. We used a novel, dual-stream algorithmic development approach, simultaneously and separately modeling a clinically meaningful outcome (diagnosis of dementia according to specialized clinical impression) using structured and unstructured electronic health record datasets. Our clinically grounded case definition supported the inclusion of key structured variables (such as dementia International Classification of Disease codes and medications) as modeling predictors rather than outcomes. Our algorithms, published in detail to support validation and replication, represent a major step forward in the use of routinely collected data for detection of diagnosed dementia.
识别痴呆症患者对于患病率估计和服务规划至关重要,但缺乏可靠且可扩展的方法。我们利用结构化和非结构化电子健康记录(EHR)数据开发了新颖的集算法,应用《精神疾病诊断与统计手册》标准来识别痴呆症病例。
我们的队列(n = 1082)包括通过专科诊所确诊的≥60岁痴呆症患者以及无痴呆症的对照组。来自澳大利亚和美国的临床医生参与了预测指标的选择。我们通过生物统计学流程开发结构化数据算法,通过自然语言处理(NLP)流程开发文本算法,并通过逻辑回归综合结果。
最终的结构化模型保留了16个变量(受试者操作特征曲线下面积[AUC]为0.853,特异性为72.2%,敏感性为80.6%)。NLP分类器(逻辑回归、支持向量机和随机森林模型)表现相当。最终的组合模型优于所有其他模型(AUC = 0.951,与结构化模型相比P < 0.001)。
将文本衍生的见解融入基于结构化医疗数据训练的算法中,可显著提高痴呆症识别能力。
现有用于检测痴呆症患者的算法工具;然而,以往的工作使用的病例定义异质性大且缺乏临床意义,并且依赖诊断代码或药物等替代指标来确定病例。我们采用了一种新颖的双流算法开发方法,同时并分别使用结构化和非结构化电子健康记录数据集对具有临床意义的结果(根据专业临床印象诊断痴呆症)进行建模。我们基于临床的病例定义支持将关键的结构化变量(如痴呆症国际疾病分类代码和药物)作为建模预测指标而非结果纳入。我们详细发表的算法以支持验证和复制,代表了在利用常规收集的数据检测已确诊痴呆症方面向前迈出的重要一步。