Sharma Suchetha, Liu Jiebei, Abramowitz Amy Caroline, Geary Carol Reynolds, Johnston Karen C, Manning Carol, Van Horn John Darrell, Zhou Andrea, Anzalone Alfred J, Loomba Johanna, Pfaff Emily, Brown Don
School of Data Science, University of Virginia, Charlottesville, VA 22903, United States.
Department of Systems Engineering, University of Virginia, Charlottesville, VA 22904, United States.
JAMIA Open. 2024 Aug 6;7(3):ooae076. doi: 10.1093/jamiaopen/ooae076. eCollection 2024 Oct.
To provide a foundational methodology for differentiating comorbidity patterns in subphenotypes through investigation of a multi-site dementia patient dataset.
Employing the National Clinical Cohort Collaborative Tenant Pilot (N3C Clinical) dataset, our approach integrates machine learning algorithms-logistic regression and eXtreme Gradient Boosting (XGBoost)-with a diagnostic hierarchical model for nuanced classification of dementia subtypes based on comorbidities and gender. The methodology is enhanced by multi-site EHR data, implementing a hybrid sampling strategy combining 65% Synthetic Minority Over-sampling Technique (SMOTE), 35% Random Under-Sampling (RUS), and Tomek Links for class imbalance. The hierarchical model further refines the analysis, allowing for layered understanding of disease patterns.
The study identified significant comorbidity patterns associated with diagnosis of Alzheimer's, Vascular, and Lewy Body dementia subtypes. The classification models achieved accuracies up to 69% for Alzheimer's/Vascular dementia and highlighted challenges in distinguishing Dementia with Lewy Bodies. The hierarchical model elucidates the complexity of diagnosing Dementia with Lewy Bodies and reveals the potential impact of regional clinical practices on dementia classification.
Our methodology underscores the importance of leveraging multi-site datasets and tailored sampling techniques for dementia research. This framework holds promise for extending to other disease subtypes, offering a pathway to more nuanced and generalizable insights into dementia and its complex interplay with comorbid conditions.
This study underscores the critical role of multi-site data analyzes in understanding the relationship between comorbidities and disease subtypes. By utilizing diverse healthcare data, we emphasize the need to consider site-specific differences in clinical practices and patient demographics. Despite challenges like class imbalance and variability in EHR data, our findings highlight the essential contribution of multi-site data to developing accurate and generalizable models for disease classification.
通过对多中心痴呆患者数据集的研究,提供一种区分亚表型中共病模式的基础方法。
利用国家临床队列协作租户试点(N3C临床)数据集,我们的方法将机器学习算法——逻辑回归和极端梯度提升(XGBoost)与诊断层次模型相结合,以便根据共病情况和性别对痴呆亚型进行细致分类。该方法通过多中心电子健康记录(EHR)数据得到增强,实施了一种混合抽样策略,结合了65%的合成少数过采样技术(SMOTE)、35%的随机欠采样(RUS)以及用于解决类别不平衡问题的托梅克链接(Tomek Links)。层次模型进一步完善了分析,使我们能够分层理解疾病模式。
该研究确定了与阿尔茨海默病、血管性痴呆和路易体痴呆亚型诊断相关的显著共病模式。分类模型对阿尔茨海默病/血管性痴呆的准确率高达69%,并突出了区分路易体痴呆的挑战。层次模型阐明了路易体痴呆诊断的复杂性,并揭示了区域临床实践对痴呆分类的潜在影响。
我们的方法强调了利用多中心数据集和量身定制的抽样技术进行痴呆研究的重要性。这个框架有望扩展到其他疾病亚型,为更细致、更具普遍性地洞察痴呆及其与共病状况的复杂相互作用提供一条途径。
本研究强调了多中心数据分析在理解共病与疾病亚型之间关系方面的关键作用。通过利用多样化的医疗保健数据,我们强调需要考虑临床实践和患者人口统计学方面的特定地点差异。尽管存在类别不平衡和EHR数据变异性等挑战,但我们的研究结果突出了多中心数据对开发准确且具有普遍性的疾病分类模型的重要贡献。